<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://www.layonsan.com/feed.xml" rel="self" type="application/atom+xml" /><link href="http://www.layonsan.com/" rel="alternate" type="text/html" hreflang="en" /><updated>2025-09-10T09:25:31+00:00</updated><id>http://www.layonsan.com/feed.xml</id><title type="html">layonsan</title><subtitle>My blog</subtitle><author><name>Blog Author</name></author><entry><title type="html">Finetuning LLMs using Federated Learning</title><link href="http://www.layonsan.com/finetuning_llm_using_fl/" rel="alternate" type="text/html" title="Finetuning LLMs using Federated Learning" /><published>2025-07-24T00:00:00+00:00</published><updated>2025-07-24T00:00:00+00:00</updated><id>http://www.layonsan.com/finetuning_llm_using_fl</id><content type="html" xml:base="http://www.layonsan.com/finetuning_llm_using_fl/"><![CDATA[<p>My capstone while pursuing my masters in data science was centered on finetuning large language models (LLMs) using Federated Learning (FL). I explored the potential and usage of flower framework to finetune LLMs on finance dataset via FL, a privacy-preserving training paradigm where multiple parties can collaboratively train a model under the coordination of a central server. A pre-trained LLM ready for usage on HuggingFace is used as the base for training, with instruction-tuning applied as the representative training procedure. The process of training the model using FL is carried out through 4 iterative steps – (1) global model updating (server), (2) local model training (client), (3) local model updating (client) and (4) global model aggregating.</p>

<h2 id="overall-methodology">Overall Methodology</h2>
<p><img src="/assets/images/finetuning-llm-using-fl/overall_workflow.png" alt="Overall Methdology Diagram" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h2 id="data">Data</h2>

<p>The dataset utilized is a comprehensive financial instruction dataset sourced from Huggingface, specifically the <a href="https://huggingface.co/datasets/4DR1455/finance_questions">4DR1455/finance_questions</a> collection. This dataset comprises 53,837 records, providing a robust foundation for the fine-tuning process of Large Language Models (LLMs) using federated learning techniques. Given its substantial size and focus on financial instructions, this dataset offers a rich variety of financial queries and responses, making it particularly suitable for training LLMs to understand and generate finance-related content.</p>

<h2 id="frameworks-for-federated-llms">Frameworks for Federated LLMs</h2>

<p>There are several emerging frameworks designed to support federated fine-tuning of large language models:</p>

<ul>
  <li>OpenFedLLM: Provides a concise framework for federated instruction tuning and federated value alignment, with support for multiple domains (e.g., finance, education) and techniques like LoRA for parameter-efficient fine-tuning.</li>
  <li>FederatedScope-LLM (FS-LLM): An extension of the FederatedScope platform with modules for benchmarks, algorithms, and training workflows, making it easier to evaluate and experiment with federated LLMs.</li>
</ul>

<p>Both are exciting contributions, but they’re still very new and face challenges like limited community support, unclear backwards compatibility, and adoption barriers in real-world applications.</p>

<h3 id="why-i-used-flower">Why I Used Flower</h3>

<p>I chose to build on <a href="https://flower.ai/">Flower</a>, an open-source federated learning framework that has gained stronger traction and community adoption.</p>

<ul>
  <li>Proven foundation: Flower focuses on federated learning and privacy-enhancing technologies, with practical use cases demonstrated in both academia and industry</li>
  <li>Community &amp; support: Unlike newer frameworks, Flower already has an active developer community, more robust documentation, and backing from venture funding (Felicis Ventures), giving it momentum for long-term sustainability.</li>
  <li>Origins: Flower started as a research project at the University of Cambridge and later evolved into Flower Labs, an AI startup.</li>
  <li>Practical relevance: Within the federated learning landscape, Flower is increasingly used in real-world implementations, making it a reliable choice for experimentation with federated fine-tuning of LLMs.</li>
</ul>

<h2 id="federated-learning-strategies-used">Federated Learning Strategies Used</h2>

<p>I experimented with five different federated learning strategies to fine-tune large language models (LLMs). Each strategy tackles the challenge of training models across distributed, non-shared datasets in slightly different ways:</p>

<ol>
  <li>
    <p>FedAvg (Federated Averaging)
The classic baseline in federated learning. Each client trains locally, and then the server averages the updates, weighted by data size. It’s simple and communication-efficient, but struggles when client data is very different (non-IID).</p>
  </li>
  <li>
    <p>FedProx
An improvement over FedAvg. It adds a “proximal term” during training to keep local updates closer to the global model. This helps reduce instability when client datasets vary a lot.</p>
  </li>
  <li>
    <p>FedAdam
Brings the popular Adam optimizer into federated learning. Instead of just averaging updates, it adapts learning rates and uses momentum to speed up and stabilize training—especially useful when data is inconsistent across clients.</p>
  </li>
  <li>
    <p>FedAdaGrad
Adapts the AdaGrad optimizer for federated learning. Each client adjusts its learning rate based on past gradients, so common patterns converge faster while rare ones don’t get overemphasized. Great for clients with very different data characteristics.</p>
  </li>
  <li>
    <p>FedAvgM (FedAvg with Momentum)
Adds a momentum term to FedAvg. Rather than averaging updates naively, it considers past updates too, which smooths training and reduces oscillations. This makes it more stable in challenging federated setups.</p>
  </li>
</ol>

<h2 id="federated-instructed-tuning">Federated Instructed Tuning</h2>
<p>To make the project feasible within limited time and compute resources, I used smaller T5 models instead of larger LLMs. While compact, T5 still captures many of the behaviors of bigger models, making it a practical stand-in (proxy model) for testing federated learning in financial instruction tasks. This approach demonstrates that meaningful research can still be done responsibly, without requiring massive infrastructure.</p>

<h2 id="training-setup">Training Setup</h2>

<p>To ensure a fair comparison between the baseline (centralized training) and federated learning, both were trained for a total of 9 epochs:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Baseline (centralized): 9 epochs straight
Federated models: 3 epochs × 3 rounds = 9 epochs total
Key training settings:
Batch size: 64 (with auto-adjust to fit hardware)
Optimizer: Adafactor + cosine learning rate scheduler
Learning rate: 2e-5
Weight decay: 0.01 (to reduce overfitting)
Max sequence length: 512 tokens
Precision: bfloat16 (bf16) for efficiency
Packing: Enabled (to improve efficiency with variable-length inputs)
Evaluation &amp; logging: Every 50 steps (tracked with Weights &amp; Biases)
Hardware: Google Colab A100 GPU, optimized with 10 dataloader workers
Checkpoints: Saved in safetensors format for compatibility and security
</code></pre></div></div>

<p>This setup was tuned to strike a balance between efficiency, performance, and generalization on the financial instruction dataset.</p>

<h2 id="evaluation-metrics">Evaluation Metrics</h2>

<p>Since the models generate text, I evaluated them using ROUGE (ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-SUM) and BLEU. These metrics measure how closely the model’s output matches the reference answers, focusing on overlap of words, phrases, and sequences.</p>

<h2 id="results">Results</h2>

<h3 id="finetuning-training-evaluation">Finetuning Training Evaluation</h3>

<p>The experiments revealed a clear performance hierarchy:</p>

<ul>
  <li>Baseline (centralized training): As expected, this model performed best across all metrics. It served as the benchmark for comparison.</li>
  <li>FedAvgM (with momentum): Consistently the top-performing federated strategy. It achieved the lowest loss (0.160) and the highest ROUGE scores, outperforming FedProx (0.175 loss), FedAvg (0.182), and FedAdaGrad (0.216).</li>
  <li>FedAdam: Surprisingly, this strategy lagged far behind the others. Its performance dropped significantly across both loss and ROUGE metrics.</li>
</ul>

<p>In short: while federated models still trailed the centralized baseline, FedAvgM showed strong promise as a practical strategy for stabilizing training and improving text generation quality in federated setups.</p>

<p><img src="/assets/images/finetuning-llm-using-fl/training_evaluation.png" alt="Training Evaluation" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h3 id="predictions-evaluation">Predictions Evaluation</h3>

<p>When evaluating predictions with ROUGE metrics, the results aligned closely with the training phase but revealed some interesting nuances:</p>

<ul>
  <li>Baseline (centralized training): Maintained its lead, confirming its role as the strongest benchmark.</li>
  <li>FedAvgM, FedProx, and FedAvg: Performed very similarly during prediction, with ROUGE-SUM scores clustered at 0.046 (FedAvgM), 0.045 (FedProx), and 0.046 (FedAvg). This suggests that, in practice, their performance differences are marginal.</li>
  <li>FedAdaGrad: While competitive during training, its performance dipped slightly in prediction evaluation, falling behind the leading strategies.</li>
  <li>FedAdam: Consistently underperformed, with a ROUGE-SUM score of just 0.003, far below all other strategies.</li>
</ul>

<p><img src="/assets/images/finetuning-llm-using-fl/predict_evaluation.png" alt="Predict Evaluation" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h2 id="takeaways">Takeaways</h2>

<p>The experiments highlighted several important findings about federated optimization strategies for fine-tuning language models on financial data:</p>

<ol>
  <li>
    <p>FedAvgM consistently leads
Across all metrics, FedAvgM outperformed other federated strategies. Its use of momentum helped speed up convergence and reduce instability during training, making it especially effective in decentralized settings. This mirrors findings from other research showing that momentum improves stability in distributed optimization.</p>
  </li>
  <li>
    <p>Domain-specific challenges
Because the dataset consisted of finance-related chats and responses, models had to capture precise terminology and context. Federated learning complicates this further, since each client may have slightly different distributions of financial text. FedAvgM’s stability appears to help the model generalize better across these variations while preserving terminological accuracy.</p>
  </li>
  <li>
    <p>Impact of model size
The project used T5-small, which has far fewer parameters than large models like GPT. While this made training feasible, smaller models are generally more sensitive to data heterogeneity in federated setups. This may explain why some strategies struggled: they simply don’t have the capacity to absorb highly diverse updates. The weak performance of FedAdam further supports this idea, as Adam-based optimizers are known to struggle in federated environments with non-IID client data.</p>
  </li>
  <li>
    <p>ROUGE scores matter in finance
Since financial conversations require both accuracy and fluency, ROUGE scores were especially relevant. Higher ROUGE indicates the model captured the right terminology while staying coherent. FedAvgM’s strong ROUGE results suggest it not only reduced loss but also generated more consistent, domain-appropriate responses.</p>
  </li>
  <li>
    <p>FedAdam underperformed
One unexpected finding was how poorly FedAdam performed compared to other strategies. While Adam is a go-to optimizer in centralized deep learning, it doesn’t translate well to federated learning. Its adaptive moment updates seem too unstable when client data is non-IID, reinforcing the need to carefully re-evaluate optimizers before applying them in decentralized contexts.</p>
  </li>
</ol>]]></content><author><name>Blog Author</name></author><category term="Data Science" /><category term="Data Engineering" /><category term="Machine Learning" /><category term="Ml Platform" /><category term="Batch Serving" /><summary type="html"><![CDATA[Portfolio of my work using Azure]]></summary></entry><entry><title type="html">Implementing an end-to-end ML system using batch-serving architecture</title><link href="http://www.layonsan.com/ml_system/" rel="alternate" type="text/html" title="Implementing an end-to-end ML system using batch-serving architecture" /><published>2024-01-16T00:00:00+00:00</published><updated>2024-01-16T00:00:00+00:00</updated><id>http://www.layonsan.com/ml_system</id><content type="html" xml:base="http://www.layonsan.com/ml_system/"><![CDATA[<p>Here’s the development process of an end-to-end machine learning (ML) platform designed to accommodate a batch-serving architecture. This initiative is part of my 2023 goal plan which aims to expand my engineering capabilities into the realm of AI/ML deployments. It draws inspiration and insights from <a href="https://towardsdatascience.com/tagged/full-stack-mlops">Paul Iusztin’s comprehensive Full Stack MLOps Guide</a>. Rather than merely duplicating his project, I elevated the endeavor by incorporating a distinct dataset. Capitalizing on the geographical context of Singapore, I utilized the Open Government Application Programming Interface (API) to extract PM2.5 data. Consequently, although the infrastructure stack and logic align closely with the reference guide, notable distinctions arise in the components responsible for preprocessing, prediction, and inference. The source code can located in this <a href="https://github.com/leonswl/ml_pm25">GitHub repository</a>.</p>

<p><strong>Overall Architecture</strong>
<img src="/assets/images/ml-system/ml_architecture_drawio.png" alt="Overall Architecture Diagram" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h2 id="1-feature-pipelines">1. Feature Pipelines</h2>

<p>The first component of the ML system is to extract and perform feature engineering on the data before loading the transformed data into a feature store.</p>

<h3 id="11-data">1.1. Data</h3>

<p>I decided to use a real time API from <a href="https://beta.data.gov.sg/collections/1394/datasets/d_9b2d180c92c4a3c45b5c671937bd1b5d/view">Data Gov</a> as the data source. The API allow us to query hourly recorded PSI data for various regions in Singapore.</p>

<p>An extraction API script will serve to pull the data using a GET http request.</p>

<h3 id="12-feature-engineering">1.2. Feature Engineering</h3>

<p>Some fair amount of preprocessing will be required to prepared the data as features. The payload schema will need to be flatten and transformed to get the relevant records - <code class="language-plaintext highlighter-rouge">timestamp</code>, <code class="language-plaintext highlighter-rouge">update_timestamp</code>, <code class="language-plaintext highlighter-rouge">readings_&lt;regions&gt;</code>. Regions comprises of north, south, east, west and central. For instance, a target variable <code class="language-plaintext highlighter-rouge">reading_average</code> is created from averaging the hourly PSI for each regions.</p>

<h3 id="13-hopswork-feature-store">1.3. Hopswork Feature Store</h3>

<blockquote>
  <p>Hopsworks is a flexible and modular feature store that provides seamless integration for existing pipelines, superior performance for any SLA, and increased productivity for data and AI teams.</p>
</blockquote>

<p>The feature pipelines section focuses on leveraging APIs for extracting data, performing some feature engineering before loading them into a feature store (Hopswork).</p>

<p><img src="/assets/images/ml-system/hopswork_feature_store.png" alt="Hopswork Feature Store" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h2 id="2-training-pipelines">2. Training Pipelines</h2>

<p>The second component will be a series of training pipelines which handles the heavylifting of model training. Data is first pulled from the feature store, with metadata loaded into wandb. The data will then undergo a series of model training with the output artifacts rendered and uploaded to wandb.</p>

<h3 id="21-model-training">2.1. Model Training</h3>

<p>A baseline model using naive bayes will serve as a benchmark. Next, a fancy model comprising of sktime and LightGBM will be tuned and trained using the best configs.</p>

<p>The best model will also be loaded into Hopswork’s model registry.</p>

<h3 id="22-weights--biases-wandb">2.2. Weights &amp; Biases (wandb)</h3>

<blockquote>
  <p>Weights &amp; Biases helps AI developers build better models faster. Quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, and manage your ML workflows end-to-end.</p>
</blockquote>

<p>For each of the runs, we can track the experimental output and performance, as well as the various model metrics.</p>

<p><img src="/assets/images/ml-system/wandb_image_forecast.png" alt="Forecast" style="display:block; margin-left:auto; margin-right:auto" />
<img src="/assets/images/ml-system/wandb_image_test.png" alt="Test" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h2 id="3-batch-prediction-pipelines">3. Batch Prediction Pipelines</h2>

<p>The third component, centering on batch prediction, entails a relatively straightforward procedure. Data is retrieved in batches from the Hopswork feature store, subjected to model inference to produce predictions, and subsequently linked to a cloud storage facility for caching the generated outputs.</p>

<h3 id="31-google-cloud-storage-gcs">3.1. Google Cloud Storage (GCS)</h3>

<p>The Google Cloud Storage (GCS) serves as the repository for diverse data files stored in parquet formats, encompassing X and y features, predictions, and monitoring data. Although several tools, like Redis, are adept at caching predictions, incorporating such tools would have introduced complexity to the components, which falls outside the primary scope of this project.</p>

<p>To connect to a GCS bucket, I’ll create a GCP service account with the appropriate access credentials in order to connect to the bucket from the python scripts.</p>

<h3 id="32-batch-prediction">3.2. Batch Prediction</h3>

<p>Each run involves extracting a batch of data within a specified datetime range, streamlining the batch inference process. The most recent and optimal model is loaded into memory by downloading the artifact from the model registry. Subsequently, the model predicts PSI values for the upcoming 24 hours, and these predictions are then stored in the Google Cloud Storage (GCS) bucket.</p>

<h2 id="4-scheduling-and-orchestration-using-airflow">4. Scheduling and Orchestration using Airflow</h2>

<h3 id="41-pypi-server">4.1. Pypi Server</h3>

<p>The PyPi registry is a server where you can host various Python modules. Only people with access to the PyPi server can install packages from it. A private PyPi server is configured to host the feature, training and batch prediction pipelines.</p>

<p>Poetry is used to package the feature, training and batch prediction pipelines as individual packages before uploading to the server.</p>

<h3 id="42-airflow">4.2. Airflow</h3>

<p>Airflow is used to schedule and orchestrate the pipelines using DAGs. Here’s an overview of how the flow and branching of DAGs are configured in Airflow.
<img src="/assets/images/ml-system/ml_pipeline_dags.png" alt="Airflow ML system DAGs" style="display:block; margin-left:auto; margin-right:auto" /></p>

<h2 id="5-continuous-monitoring-for-model-performance">5. Continuous Monitoring for Model Performance</h2>

<h3 id="51-great-expectation-ge-suite">5.1. Great Expectation (GE) Suite</h3>

<p><a href="https://docs.greatexpectations.io/docs/0.15.50/terms/expectation_suite/">GE Suite</a> serves as a tool comprising verifiable assertions regarding data integrity. Hopsworks integrates GE support, enabling the addition of a GE validation suite to Hopsworks to define the expected behavior of new data.</p>

<blockquote>
  <p>An expectation is a verifiable assertion about data</p>
</blockquote>

<p>Several expectations include:</p>
<ul>
  <li>Ensuring that table columns align with a predefined ordered list.</li>
  <li>Verifying that the total number of columns is 7.</li>
  <li>Affirming that timestamp columns cannot be null.</li>
  <li>Specifying that readings columns are of type int32 and possess a minimum and maximum value of 0 and 500, respectively.</li>
</ul>

<h3 id="52-ml-monitoring">5.2. ML Monitoring</h3>

<p>Ensuring the consistent and expected performance of the production system over time is crucial. Implementing a machine learning monitoring process establishes a mechanism to address any issues that may arise, facilitating the adaptation of the system and retraining the model in response to changes in the environment.</p>

<p>For instance, the Mean Absolute Percentage Error (MAPE) metric is continuously computed. A spike in this metric serves as an alarm, prompting actions such as fine-tuning the model or adjusting model configurations as necessary.</p>

<h2 id="6-fastapi-and-streamlit">6. FastAPI and Streamlit</h2>

<p>FastAPI and Streamlit will serve as the backend and frontend backbone for retrieving model ouputs (predictions and monitoring metrics) and rendering as an dashboard for visual purposes. Both applications are dockerised and deployed.</p>

<h3 id="61-fastapi">6.1. FastAPI</h3>
<p>FastAPI is used as the backend to consume predicions and monitoring metrics from GCS and expose them through a RESTful API. A variety of endpoints are defined to GET the predictions and monitoring metrics.</p>

<p>Endpoints:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">\health</code>: Health check</li>
  <li><code class="language-plaintext highlighter-rouge">\predictions</code>: GET prediction values</li>
  <li><code class="language-plaintext highlighter-rouge">\monitoring/metrics</code>: GET aggregated monitoring metrics</li>
</ul>

<p>Upon receiving the data request, it will access the data storage encoded to the preconfigured Pydantic schema. The retrieved response is subsequently decoded to JSON.</p>

<p><img src="/assets/images/ml-system/fastapi.png" alt="FastAPI docs" /></p>

<h3 id="62-streamlit">6.2. Streamlit</h3>
<p>Streamlit will be the frontend application that renders the data to visualise 2 dashboards:</p>
<ol>
  <li>predictions<br />
<img src="/assets/images/ml-system/app-predictions.png" alt="Prediction Web App" style="display:block; margin-left:auto; margin-right:auto" /></li>
  <li>monitoring metrics<br />
<img src="/assets/images/ml-system/app-monitoring.png" alt="Monitoring Web App" style="display:block; margin-left:auto; margin-right:auto" /></li>
</ol>

<h2 id="7-system-deployment-using-gcp">7. System Deployment using GCP</h2>

<p>Due to cost considerations, I have opted to exclude this section, as it falls outside the project’s defined scope.</p>

<p>In a production environment, the preferred approach involves deploying all machine learning components to a cloud provider (e.g., AWS, GCP, Azure) and establishing a Continuous Integration/Continuous Deployment (CI/CD) pipeline utilizing tools such as Github Actions or Azure Pipelines, among others.</p>]]></content><author><name>Blog Author</name></author><category term="Data Science" /><category term="Data Engineering" /><category term="Machine Learning" /><category term="Ml Platform" /><category term="Batch Serving" /><summary type="html"><![CDATA[Portfolio of my work using Azure]]></summary></entry><entry><title type="html">Data Systems using Azure</title><link href="http://www.layonsan.com/Azure/" rel="alternate" type="text/html" title="Data Systems using Azure" /><published>2024-01-04T00:00:00+00:00</published><updated>2024-01-04T00:00:00+00:00</updated><id>http://www.layonsan.com/Azure</id><content type="html" xml:base="http://www.layonsan.com/Azure/"><![CDATA[<p>I have been working with Azure cloud services for the past 1-2 years, complemented by the acquisition of two Microsoft Certificates: Azure Fundamentals and Azure Data Engineering. In this post, I will highlight a few pivotal projects where I played a central role. This exposition is less of a guide but rather a comprehensive display illustrating the integration of these services to accomplish each project’s specific objectives.</p>

<h2 id="sftp-architecture">SFTP Architecture</h2>

<p><img src="/assets/images/azure-portfolio/sftp_architecture.jpg" alt="SFTP Architecture" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>The primary aim of this project was to establish a data system capable of automating the encrypted file transfer to a commercial partner through SFTP. It was imperative that the transferred data remained encrypted throughout the process.</p>

<ol>
  <li>
    <p>Database (Sink)</p>

    <p>Data essential for batch transfer underwent various transformations for business use within the data warehouse before being loaded onto this server at regular batch intervals. This server acted as the source for retrieving data for our data system.</p>
  </li>
  <li>
    <p>Azure Data Factory (Transformation)</p>

    <p>Azure Data Factory (ADF) served as the central pipeline orchestration tool for executing batch data transfers. ADF played a crucial role in integrating multiple services to fulfill the project’s objectives. Its primary functions encompassed:</p>

    <ul>
      <li>Adapting data to align with the format required by our commercial partner’s existing SFTP server infrastructure.</li>
      <li>Transmitting the finalized, encrypted, and formatted files to the destination SFTP server.</li>
    </ul>
  </li>
  <li>
    <p>Azure Blob Storage (Staging)</p>

    <p>Transformed data was stored as blob files in Azure Blob Storage, functioning as an interim staging area.</p>

    <p>To maintain a clear demarcation between encrypted and unencrypted data, encrypted data was stored separately, facilitating a more transparent debugging process. Staging data twice allowed pinpointing issues from the unencrypted files onward, bypassing the data transformation process.</p>

    <p>Additionally, each batch’s encrypted AES keys were stored within this storage environment.</p>
  </li>
  <li>
    <p>Azure Function (Encryption)</p>

    <p>Adhering to security requirements mandating encryption at rest and in transit, a 2-stage hybrid encryption employing RSA and AES was implemented on the data files themselves.</p>

    <p>While <a href="https://learn.microsoft.com/en-us/azure/security/fundamentals/encryption-overview">Azure</a> ensures encryption at rest and in transit, the intricacies of hybrid encryption demanded a custom solution. Leveraging Azure Function, the encryption logic was managed and deployed using the <a href="https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python?tabs=asgi%2Capplication-level&amp;pivots=python-mode-decorators">python V2 programming model</a>.</p>
  </li>
  <li>
    <p>Azure Key Vault (Secure keys)</p>

    <p>RSA key certificates were securely stored within the Azure Key Vault. The Azure Function accessed these keys solely during the encryption process, guaranteeing the constant protection and security of the RSA key.</p>
  </li>
</ol>

<h2 id="dbt-architecture">dbt Architecture</h2>

<p><img src="/assets/images/azure-portfolio/dbt_architecture.jpg" alt="dbt Architecture" style="display:block; margin-left:auto; margin-right:auto" /></p>

<p>The project aimed to devise and establish a straightforward architecture supporting the deployment of dbt. This shift was intended to transition away from Alteryx as the primary ETL/ELT tool toward a more adaptable and resilient infrastructure that champions dbt for data transformation.</p>

<ol>
  <li>
    <p>Container Registry (Containerisation)</p>

    <p>Upon containerizing the dbt project into a Docker image, the image is stored within the Container Registry.</p>
  </li>
  <li>
    <p>Container Instance</p>

    <p>Deployment of the containerized application is accomplished via a singular instance of Azure Container Instances. Unlike continuous container operation, the container is activated solely during the execution of dbt runs. This approach ensures that the container remains inactive during periods of inactivity in dbt runs to save cost.</p>
  </li>
  <li>
    <p>Data Factory (Orchestration &amp; Monitoring)</p>

    <p>Azure Data Factory operates as the orchestration tool, responsible for scheduling and executing the containerized dbt application. Triggers are utilized to initiate the container and commence the dbt runs.</p>

    <p>Furthermore, the REST API is leveraged for monitoring the container’s status. This enables efficient tracking of the container’s state.</p>
  </li>
</ol>]]></content><author><name>Blog Author</name></author><category term="Data Science" /><category term="Data Engineering" /><category term="Machine Learning" /><category term="Azure" /><summary type="html"><![CDATA[Portfolio of my work using Azure]]></summary></entry><entry><title type="html">2023 Review</title><link href="http://www.layonsan.com/2023_Review/" rel="alternate" type="text/html" title="2023 Review" /><published>2024-01-03T00:00:00+00:00</published><updated>2024-01-03T00:00:00+00:00</updated><id>http://www.layonsan.com/2023_Review</id><content type="html" xml:base="http://www.layonsan.com/2023_Review/"><![CDATA[<p>This reflection and review for 2023 will incorporate a more personal element instead of focusing on purely work topics.</p>

<h3 id="summary">Summary</h3>

<p>In 2023, my life was a whirlwind, a rollercoaster of experiences that brought both challenges and exhilarating moments across different aspects. Reflecting on this eventful year, I appreciate the highs and acknowledge the lows. While I aspired to achieve numerous goals, I fell short in several areas. However, this has only fueled my determination to make 2024 a year of growth and achievement. I will instead commit myself to focusing on excelling in a few key areas that matter most to me.</p>

<h3 id="school">School</h3>
<p>Academically, diving into my part-time master’s program in Data Science at NTU exposed me to intriguing modules like data systems, machine learning applications, and the mathematics behind AI. Balancing work and academics was tough, but the rewards were immense, leaving me eagerly anticipating more learning and personal growth in 2024.</p>

<h3 id="work">Work</h3>
<p>Professionally, my focus shifted towards data engineering and Azure cloud computing at work. Engaging in various projects expanded my expertise in data modeling, ETL/ELT pipelines, and developing AI/ML Proof of Concepts within the Azure ecosystem. However, the year ended with some professional turbulence, leaving a bittersweet feeling as I reflected on my growth.</p>

<h3 id="personal">Personal</h3>
<p>On a personal note, milestones abounded. My engagement and subsequent wedding planning with my partner, along with our joint venture in buying and renovating a resale flat, marked significant strides in our lives. Despite time constraints limiting my interactions, catching up with close friends and witnessing my best friend’s marriage were treasured moments.</p>

<h3 id="travel">Travel</h3>
<p>Travel-wise, the experiences were diverse and enriching. Beginning the year with an enchanting trip to Germany and Poland with my partner set the tone for an adventurous year. Family trip to Krabi and a rejuvenating Bali getaway with friends were refreshing. The last trip of the year was a bachelor adventure to Bangkok which introduced me to new friendships and unforgettable memories.</p>

<h2 id="improvements">Improvements</h2>
<p>Looking back on my goal last year</p>
<blockquote>
  <p>🥅 2023 Goals: Gain a deeper understanding in Causal Inference and engage in more practical application of data/ML engineering and ops.</p>
</blockquote>

<ul>
  <li>
    <p>Causal Inference: Unfortunately, I didn’t make much progress in this area due to the significant influx of events—starting my master’s, the GenAI hype, etc.</p>
  </li>
  <li>
    <p>DataOps and Data Engineering: My work allowed me to make significant progress in this area. I acquired Microsoft Certificates: Azure Fundamentals and Azure Data Engineer. I became proficient in various Azure cloud services such as Data Factory, Function, Logic App, Stream Analytics, EventHub, Machine Learning, etc.</p>
  </li>
  <li>
    <p>MLOps and ML Engineering: Although slightly delayed, I am currently working on a project that focuses on implementing an end-to-end ML platform.</p>
  </li>
</ul>

<h2 id="2024-plan">2024 Plan</h2>
<ol>
  <li>
    <p>Plan and enjoy a lovely wedding and honeymoon that my wife and I will always remember fondly.</p>
  </li>
  <li>
    <p>Dive deeper into GenAI projects, aligning with my studies to explore the realms of deep learning, neural networks, and reinforcement learning. This will fortify my grasp on generative models, enhancing my expertise in this dynamic field.</p>
  </li>
  <li>
    <p>Expand my reading horizons. Although my exploration of stoicism in 2023 impacted my reading pace, I aim to continue delving into diverse genres while striving to maintain a steady reading habit.</p>
  </li>
  <li>
    <p>Cultivate a consistent writing practice. While I authored two articles in 2022, I didn’t contribute any last year. Writing not only reinforces my understanding but also allows me to share my insights with the community — an endeavor I consider invaluable. I aim to write and share my learnings more frequently in 2024, contributing to the collective growth of knowledge.</p>
  </li>
</ol>

<p><br />
<br /></p>

<p><em>On a side note, it’s possible that this portfolio website is due for an exciting upgrade soon!</em></p>]]></content><author><name>Blog Author</name></author><category term="Personal Life" /><category term="Data Science" /><category term="Portfolio" /><category term="Machine Learning" /><summary type="html"><![CDATA[Summary of my work in 2023]]></summary></entry><entry><title type="html">2022 Project Summary</title><link href="http://www.layonsan.com/2022_Portfolio/" rel="alternate" type="text/html" title="2022 Project Summary" /><published>2022-10-19T00:00:00+00:00</published><updated>2022-10-19T00:00:00+00:00</updated><id>http://www.layonsan.com/2022_Portfolio</id><content type="html" xml:base="http://www.layonsan.com/2022_Portfolio/"><![CDATA[<p>Instead of working on analytical insights projects in 2022, I decided to spin up something different. There are 2 notable projects I have been working on this year: Medium Articles and Churn Models on Streamlit.</p>

<h3 id="1-medium-articles">1. Medium Articles</h3>

<p>I wrote and published 2 articles to <a href="https://towardsdatascience.com/">Towards Data Science (TDS)</a> on <a href="https://medium.com/">medium</a></p>

<p><img src="/assets/img/2022_Portfolio_files/medium_ss.png" alt="screenshot of medium articles" /></p>

<ul>
  <li><a href="https://medium.com/towards-data-science/github-analytical-workflow-for-data-analysts-31a28035b563">GitHub Analytical Workflow for Data Analysts</a></li>
  <li><a href="https://medium.com/towards-data-science/modularise-your-notebook-into-scripts-5d5ccaf3f4f3">Modularise your Notebook into Scripts</a></li>
</ul>

<hr />

<h3 id="2-churn-models-on-streamlit">2. Churn Models on Streamlit</h3>

<p>Another key focus on the year was learning and deploying streamlit. There are many web frameworks to render dashboards using python such as dash, flask and streamlit. I decided to go with streamlit as it is capable of turning data scripts into shareable web apps in minutes all in pure Python with no requirements on front‑end experience. Regarding the demo topic, I chose churn models to illustrate how supervised classification models such as GLMs and Random Forests can be utilised to address problems and generate insights.</p>

<p>Feel free to visit my <a href="https://leonswl-churn-model-main-5fs093.streamlitapp.com/">streamlit app</a> to see how streamlit can be used to showcase experimental machine learning models to tackle churn prediction problems. To view the source code, you can also checkout the GitHub repo <a href="https://github.com/leonswl/churn-model">here</a>.</p>

<p><img src="/assets/img/2022_Portfolio_files/streamlit.gif" alt="streamlit demo gif" /></p>]]></content><author><name>Blog Author</name></author><category term="Data Analytics" /><category term="Portfolio" /><category term="Medium" /><category term="Streamlit" /><summary type="html"><![CDATA[Summary of my work in 2022]]></summary></entry><entry><title type="html">LinkedIn Network Analysis</title><link href="http://www.layonsan.com/linkedin-network-analysis/" rel="alternate" type="text/html" title="LinkedIn Network Analysis" /><published>2021-12-01T00:00:00+00:00</published><updated>2021-12-01T00:00:00+00:00</updated><id>http://www.layonsan.com/linkedin-network-analysis</id><content type="html" xml:base="http://www.layonsan.com/linkedin-network-analysis/"><![CDATA[<p>What does your LinkedIn network really look like? I visualized my own connections using NetworkX and Plotly, turning a list of names into a living, breathing graph. Along the way, I explored concepts from network and graph theory—like centrality, clusters, and bridges—that reveal hidden patterns in how people are connected. Dive in to see how data visualization can turn something familiar into something surprisingly insightful.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Installing Libraries
</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">networkx</span> <span class="k">as</span> <span class="n">nx</span>
<span class="kn">from</span> <span class="nn">pyvis</span> <span class="kn">import</span> <span class="n">network</span> <span class="k">as</span> <span class="n">net</span>
<span class="kn">import</span> <span class="nn">janitor</span>

<span class="kn">import</span> <span class="nn">plotly.express</span> <span class="k">as</span> <span class="n">px</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">from</span> <span class="nn">IPython.core.display</span> <span class="kn">import</span> <span class="n">display</span><span class="p">,</span> <span class="n">HTML</span>

</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Loading dataset
</span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'data/Connections.csv'</span><span class="p">,</span><span class="n">skiprows</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="n">df</span><span class="p">.</span><span class="n">info</span><span class="p">()</span> <span class="c1"># summary info
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>&lt;class 'pandas.core.frame.DataFrame'&gt;
RangeIndex: 400 entries, 0 to 399
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   First Name     397 non-null    object
 1   Last Name      397 non-null    object
 2   Email Address  7 non-null      object
 3   Company        390 non-null    object
 4   Position       390 non-null    object
 5   Connected On   400 non-null    object
dtypes: object(6)
memory usage: 18.9+ KB
</code></pre></div></div>

<p>At a quick glance, I have about 400 connections.</p>

<h2 id="data-cleaning">Data Cleaning</h2>

<p>I will perform some cleaning, remove unnecessary attributes and remove null values from the data.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">new_df</span> <span class="o">=</span> <span class="p">(</span>
        <span class="n">df</span><span class="p">.</span><span class="n">clean_names</span><span class="p">()</span> <span class="c1"># remove spacing and capitalisation
</span>        <span class="p">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'first_name'</span><span class="p">,</span><span class="s">'last_name'</span><span class="p">,</span><span class="s">'email_address'</span><span class="p">])</span> <span class="c1"># dropped first, last and email
</span>        <span class="p">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">subset</span><span class="o">=</span><span class="p">[</span><span class="s">'company'</span><span class="p">,</span><span class="s">'position'</span><span class="p">])</span> <span class="c1"># remove null values in company and position
</span>        <span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="s">'connected_on'</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'%d %b %Y'</span><span class="p">)</span> <span class="c1"># convert date column to datetime object
</span><span class="p">)</span>
<span class="n">new_df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>company</th>
      <th>position</th>
      <th>connected_on</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>InfoCepts</td>
      <td>Talent Acquisition Lead</td>
      <td>2021-11-28</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Yara International</td>
      <td>Associate data engineer</td>
      <td>2021-11-27</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Yara International</td>
      <td>Lead Recruiter, Digital Ag Solutions</td>
      <td>2021-11-25</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Yara International</td>
      <td>Data Scientist</td>
      <td>2021-11-25</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Yara International</td>
      <td>Associate Digital Information Specialist</td>
      <td>2021-11-25</td>
    </tr>
  </tbody>
</table>
</div>

<h2 id="data-exploration">Data Exploration</h2>

<ol>
  <li>Connnections at a glance</li>
  <li>New connections over time</li>
  <li>Top 15 companies my connections work at</li>
  <li>Top 15 roles my connections work as</li>
</ol>

<h3 id="connections-at-a-glance">Connections at a glance</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">new_df1</span> <span class="o">=</span> <span class="n">new_df</span><span class="p">[[</span><span class="s">'company'</span><span class="p">,</span><span class="s">'position'</span><span class="p">]]</span>
<span class="n">new_df1</span><span class="p">[</span><span class="s">'My Network'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'My Network'</span>

<span class="n">px</span><span class="p">.</span><span class="n">treemap</span><span class="p">(</span><span class="n">new_df1</span><span class="p">,</span> <span class="n">path</span><span class="o">=</span><span class="p">[</span><span class="s">'My Network'</span><span class="p">,</span> <span class="s">'company'</span><span class="p">,</span> <span class="s">'position'</span><span class="p">],</span> <span class="n">width</span><span class="o">=</span><span class="mi">1200</span><span class="p">,</span> <span class="n">height</span><span class="o">=</span><span class="mi">1200</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/images/linkedin-network-analysis/plotly-treemap.png" alt="plotly treemap" /></p>

<h3 id="new-connections-over-time">New Connections over time</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">daily_connections</span> <span class="o">=</span> <span class="p">(</span><span class="n">new_df</span>
                    <span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'connected_on'</span><span class="p">])</span> <span class="c1"># group by date
</span>                    <span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="c1"># sum up new connections per day
</span>                    <span class="p">.</span><span class="n">plot</span><span class="p">()</span> <span class="c1"># plot line chart
</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/assets/images/linkedin-network-analysis/linkedin-network-analysis_10_0.png" alt="connections line graph" /></p>

<p>Looking at the number of new connections over time since i joined LinkedIn, bulk of my connections were created during the start - period between end 2019 and start of 2020).</p>

<h3 id="top-15-companies-my-connections-work-at">Top 15 companies my connections work at</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">companies_count</span> <span class="o">=</span> <span class="p">(</span><span class="n">new_df</span>
                    <span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'company'</span><span class="p">])</span> <span class="c1"># group by country
</span>                    <span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="c1"># sum up count for each company
</span>                    <span class="p">.</span><span class="n">to_frame</span><span class="p">(</span><span class="s">'size'</span><span class="p">)</span> <span class="c1"># convert to frame
</span>                    <span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'size'</span><span class="p">],</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="c1"># sort by descending order
</span>                    <span class="p">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="p">)</span>
<span class="n">companies_count</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">15</span><span class="p">).</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'barh'</span><span class="p">).</span><span class="n">invert_yaxis</span><span class="p">()</span> <span class="c1"># convert to horizontal plot
</span></code></pre></div></div>

<p><img src="/assets/images/linkedin-network-analysis/linkedin-network-analysis_13_0.png" alt="companies bar chart" /></p>

<h3 id="top-15-roles-my-connections-are-working-in">Top 15 roles my connections are working in</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">position_count</span> <span class="o">=</span> <span class="p">(</span><span class="n">new_df</span>
                    <span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'position'</span><span class="p">])</span> <span class="c1"># group by country
</span>                    <span class="p">.</span><span class="n">size</span><span class="p">()</span> <span class="c1"># sum up count for each company
</span>                    <span class="p">.</span><span class="n">to_frame</span><span class="p">(</span><span class="s">'size'</span><span class="p">)</span> <span class="c1"># convert to frame
</span>                    <span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'size'</span><span class="p">],</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="c1"># sort by descending order
</span><span class="p">)</span>
<span class="n">position_count</span><span class="p">.</span><span class="n">head</span><span class="p">(</span><span class="mi">15</span><span class="p">).</span><span class="n">plot</span><span class="p">(</span><span class="n">kind</span><span class="o">=</span><span class="s">'barh'</span><span class="p">).</span><span class="n">invert_yaxis</span><span class="p">()</span> <span class="c1"># convert to horizontal plot
</span></code></pre></div></div>

<p><img src="/assets/images/linkedin-network-analysis/linkedin-network-analysis_15_0.png" alt="positions bar chart" /></p>

<p>The top 3 companies my connections are working in are from Yara, Archisen and NTU, which is expected given that I did my undergraduate degree in NTU, worked at Archisen after graduation before joining Yara International.</p>

<p>Most of my connections are Research Assistants, Data Scientist and Software Engineers.</p>

<h2 id="network-analysis">Network Analysis</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">companies_count</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">drop</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">companies_count_reduced</span> <span class="o">=</span> <span class="n">companies_count</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">companies_count</span><span class="p">[</span><span class="s">'size'</span><span class="p">]</span> <span class="o">&gt;=</span><span class="mi">2</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">companies_count_reduced</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(42, 2)
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">position_count</span><span class="p">.</span><span class="n">reset_index</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">position_count_reduced</span> <span class="o">=</span> <span class="n">position_count</span><span class="p">.</span><span class="n">loc</span><span class="p">[</span><span class="n">position_count</span><span class="p">[</span><span class="s">'size'</span><span class="p">]</span> <span class="o">&gt;=</span><span class="mi">2</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">position_count_reduced</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(35, 2)
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Initialise Graph
</span><span class="n">g1</span> <span class="o">=</span> <span class="n">nx</span><span class="p">.</span><span class="n">Graph</span><span class="p">()</span>
<span class="n">g1</span><span class="p">.</span><span class="n">add_node</span><span class="p">(</span><span class="s">'root'</span><span class="p">)</span> <span class="c1"># initialising myself as centrala node
</span>
<span class="c1"># 
</span><span class="k">for</span> <span class="nb">id</span><span class="p">,</span><span class="n">row</span> <span class="ow">in</span> <span class="n">companies_count_reduced</span><span class="p">.</span><span class="n">iterrows</span><span class="p">():</span>

    <span class="c1"># store company name and count
</span>    <span class="n">company</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'company'</span><span class="p">]</span>
    <span class="n">count</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'size'</span><span class="p">]</span>
    
    <span class="n">title</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"&lt;b&gt;</span><span class="si">{</span><span class="n">company</span><span class="si">}</span><span class="s">&lt;/b&gt; - </span><span class="si">{</span><span class="n">count</span><span class="si">}</span><span class="s">"</span>
    <span class="c1"># extract the positions my connections hold and store them in a set to prevent duplication
</span>    <span class="n">positions</span> <span class="o">=</span> <span class="nb">set</span><span class="p">([</span><span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">new_df</span><span class="p">[</span><span class="n">company</span> <span class="o">==</span> <span class="n">new_df</span><span class="p">[</span><span class="s">'company'</span><span class="p">]][</span><span class="s">'position'</span><span class="p">]])</span>
    <span class="n">positions</span> <span class="o">=</span> <span class="s">''</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="s">'&lt;li&gt;{}&lt;/li&gt;'</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">positions</span><span class="p">)</span>

    <span class="n">position_list</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"&lt;ul&gt;</span><span class="si">{</span><span class="n">positions</span><span class="si">}</span><span class="s">&lt;/ul&gt;"</span>
    <span class="n">hover_info</span> <span class="o">=</span> <span class="n">title</span> <span class="o">+</span> <span class="n">position_list</span>

    <span class="n">g1</span><span class="p">.</span><span class="n">add_node</span><span class="p">(</span><span class="n">company</span><span class="p">,</span> <span class="n">size</span> <span class="o">=</span> <span class="n">count</span><span class="o">*</span><span class="mi">2</span><span class="p">,</span> <span class="n">title</span> <span class="o">=</span> <span class="n">hover_info</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'#3449eb'</span><span class="p">)</span>
    <span class="n">g1</span><span class="p">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s">'root'</span><span class="p">,</span><span class="n">company</span><span class="p">,</span><span class="n">color</span><span class="o">=</span><span class="s">'grey'</span><span class="p">)</span>

<span class="c1"># Generate the graph
</span><span class="n">company_nt</span> <span class="o">=</span> <span class="n">net</span><span class="p">.</span><span class="n">Network</span><span class="p">(</span><span class="n">height</span><span class="o">=</span><span class="s">'700px'</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="s">'700px'</span><span class="p">,</span> <span class="n">bgcolor</span><span class="o">=</span><span class="s">"grey"</span><span class="p">,</span> <span class="n">font_color</span><span class="o">=</span><span class="s">'white'</span><span class="p">,</span><span class="n">notebook</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">company_nt</span><span class="p">.</span><span class="n">from_nx</span><span class="p">(</span><span class="n">g1</span><span class="p">)</span>
<span class="n">company_nt</span><span class="p">.</span><span class="n">hrepulsion</span><span class="p">()</span>

<span class="n">company_nt</span><span class="p">.</span><span class="n">show</span><span class="p">(</span><span class="s">'company_graph.html'</span><span class="p">)</span>
<span class="n">display</span><span class="p">(</span><span class="n">HTML</span><span class="p">(</span><span class="s">'company_graph.html'</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># initialize graph
</span><span class="n">g2</span> <span class="o">=</span> <span class="n">nx</span><span class="p">.</span><span class="n">Graph</span><span class="p">()</span>
<span class="n">g2</span><span class="p">.</span><span class="n">add_node</span><span class="p">(</span><span class="s">'root'</span><span class="p">)</span> <span class="c1"># intialize yourself as central
</span>
<span class="c1"># use iterrows tp iterate through the data frame
</span><span class="k">for</span> <span class="nb">id</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">position_count_reduced</span><span class="p">.</span><span class="n">iterrows</span><span class="p">():</span>

  <span class="n">count</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">row</span><span class="p">[</span><span class="s">'size'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span>
  <span class="n">position</span><span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">'position'</span><span class="p">]</span>
  
  <span class="n">g2</span><span class="p">.</span><span class="n">add_node</span><span class="p">(</span><span class="n">position</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">count</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'#3449eb'</span><span class="p">,</span> <span class="n">title</span><span class="o">=</span><span class="n">count</span><span class="p">)</span>
  <span class="n">g2</span><span class="p">.</span><span class="n">add_edge</span><span class="p">(</span><span class="s">'root'</span><span class="p">,</span> <span class="n">position</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'grey'</span><span class="p">)</span>

<span class="c1"># generate the graph
</span><span class="n">position_nt</span> <span class="o">=</span> <span class="n">net</span><span class="p">.</span><span class="n">Network</span><span class="p">(</span><span class="n">height</span><span class="o">=</span><span class="s">'700px'</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="s">'700px'</span><span class="p">,</span> <span class="n">bgcolor</span><span class="o">=</span><span class="s">"black"</span><span class="p">,</span> <span class="n">font_color</span><span class="o">=</span><span class="s">'white'</span><span class="p">,</span> <span class="n">notebook</span> <span class="o">=</span> <span class="bp">True</span><span class="p">)</span>
<span class="n">position_nt</span><span class="p">.</span><span class="n">from_nx</span><span class="p">(</span><span class="n">g2</span><span class="p">)</span>
<span class="n">position_nt</span><span class="p">.</span><span class="n">hrepulsion</span><span class="p">()</span>

<span class="n">position_nt</span><span class="p">.</span><span class="n">show</span><span class="p">(</span><span class="s">'position_graph.html'</span><span class="p">)</span>
<span class="n">display</span><span class="p">(</span><span class="n">HTML</span><span class="p">(</span><span class="s">'position_graph.html'</span><span class="p">))</span>
</code></pre></div></div>

<html>
<head>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis.css" type="text/css" />
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis-network.min.js"> </script>
<center>
<h1></h1>
</center>

<!-- <link rel="stylesheet" href="../node_modules/vis/dist/vis.min.css" type="text/css" />
<script type="text/javascript" src="../node_modules/vis/dist/vis.js"> </script>-->

<style type="text/css">

        #mynetwork1 {
            width: 700px;
            height: 700px;
            background-color: black;
            border: 1px solid lightgray;
            position: relative;
            float: left;
        }






</style>

</head>

<body>
<div id="mynetwork1"></div>


<script type="text/javascript">

    // initialize global variables.
    var edges;
    var nodes;
    var network; 
    var container;
    var options, data;


    // This method is responsible for drawing the graph, returns the drawn network
    function drawGraph() {
        var container = document.getElementById('mynetwork1');



        // parsing and collecting nodes and edges from the python
        nodes = new vis.DataSet([{"font": {"color": "white"}, "id": "root", "label": "root", "shape": "dot", "size": 10}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Yara International", "label": "Yara International", "shape": "dot", "size": 74, "title": "\u003cb\u003eYara International\u003c/b\u003e - 37\u003cul\u003e\u003cli\u003eProduct Owner\u003c/li\u003e\u003cli\u003eTechnical Lead\u003c/li\u003e\u003cli\u003eAssociate Business Analyst\u003c/li\u003e\u003cli\u003eAssociate Data Analyst\u003c/li\u003e\u003cli\u003eGrowth Product Manager | Growth Analytics Manager\u003c/li\u003e\u003cli\u003eUX Designer\u003c/li\u003e\u003cli\u003eData Analyst\u003c/li\u003e\u003cli\u003eAssociate Recruiter\u003c/li\u003e\u003cli\u003eSenior Localization Engineer\u003c/li\u003e\u003cli\u003eBusiness Support Associate\u003c/li\u003e\u003cli\u003eScrum Master\u003c/li\u003e\u003cli\u003eAssociate data engineer\u003c/li\u003e\u003cli\u003eData Science Intern\u003c/li\u003e\u003cli\u003eAssociate AI/ML Engineer, Data Science\u003c/li\u003e\u003cli\u003eSenior Product Owner | Smallholder Solutions\u003c/li\u003e\u003cli\u003eSenior UX Researcher\u003c/li\u003e\u003cli\u003eTrainee Data Analyst\u003c/li\u003e\u003cli\u003eSenior User Experience Designer\u003c/li\u003e\u003cli\u003eData Analytics Intern\u003c/li\u003e\u003cli\u003eSenior Manager APIs and Integrations Global Delivery Unit -  People Processes and Digitalization \u003c/li\u003e\u003cli\u003eLead Recruiter, Digital Ag Solutions\u003c/li\u003e\u003cli\u003eSenior Data Analyst\u003c/li\u003e\u003cli\u003eAssociate Digital Information Specialist\u003c/li\u003e\u003cli\u003ePesquisadora de Experi\u00eancia do Usu\u00e1rio\u003c/li\u003e\u003cli\u003eTrainee\u003c/li\u003e\u003cli\u003eUser Experience Researcher\u003c/li\u003e\u003cli\u003eHead of Analytics and Data Science\u003c/li\u003e\u003cli\u003eSenior Data Analyst - Product Data \u0026 Analytics\u003c/li\u003e\u003cli\u003eSenior Scrum Master | Release Train Engineer\u003c/li\u003e\u003cli\u003eData Scientist\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Archisen", "label": "Archisen", "shape": "dot", "size": 30, "title": "\u003cb\u003eArchisen\u003c/b\u003e - 15\u003cul\u003e\u003cli\u003eSenior Crop Scientist\u003c/li\u003e\u003cli\u003eFarm Manager\u003c/li\u003e\u003cli\u003eData Science Intern\u003c/li\u003e\u003cli\u003eBusiness Development Analyst\u003c/li\u003e\u003cli\u003eProduct Specialist\u003c/li\u003e\u003cli\u003eSoftware Engineer\u003c/li\u003e\u003cli\u003eCo-Founder\u003c/li\u003e\u003cli\u003eintern\u003c/li\u003e\u003cli\u003eHuman Resources Executive\u003c/li\u003e\u003cli\u003eSenior Business Development Executive\u003c/li\u003e\u003cli\u003eRobotics Engineer\u003c/li\u003e\u003cli\u003eCrop Scientist\u003c/li\u003e\u003cli\u003eSpecial projects lead\u003c/li\u003e\u003cli\u003eFarm Engineer\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Nanyang Technological University", "label": "Nanyang Technological University", "shape": "dot", "size": 28, "title": "\u003cb\u003eNanyang Technological University\u003c/b\u003e - 14\u003cul\u003e\u003cli\u003eSenior Executive\u003c/li\u003e\u003cli\u003eLecturer \u003c/li\u003e\u003cli\u003eAssistant Dean (Development)\u003c/li\u003e\u003cli\u003ePhD Candidate - Asian School of the Environment\u003c/li\u003e\u003cli\u003eUndergraduate Research Experience on Campus \u003c/li\u003e\u003cli\u003eManager, Career \u0026 Attachment Office \u003c/li\u003e\u003cli\u003eLecturer and Undergraduate Programme Coordinator, Asian School of the Environment\u003c/li\u003e\u003cli\u003eResearch Assistant\u003c/li\u003e\u003cli\u003eSenior Assistant Director\u003c/li\u003e\u003cli\u003eProfessor\u003c/li\u003e\u003cli\u003eSenior Year Thesis - Terrestrial paleoclimatology with triple oxygen isotope of speleothems\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "DBS Bank", "label": "DBS Bank", "shape": "dot", "size": 12, "title": "\u003cb\u003eDBS Bank\u003c/b\u003e - 6\u003cul\u003e\u003cli\u003eSpecialist, Graduate Associate (Software Developer) - SEED, Technology \u0026 Operations\u003c/li\u003e\u003cli\u003eSoftware Engineer\u003c/li\u003e\u003cli\u003eCloud Engineer\u003c/li\u003e\u003cli\u003eAnalyst - Group Sustainability\u003c/li\u003e\u003cli\u003eGraduate Associate (SEED)\u003c/li\u003e\u003cli\u003eApp Developer\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Shopee", "label": "Shopee", "shape": "dot", "size": 10, "title": "\u003cb\u003eShopee\u003c/b\u003e - 5\u003cul\u003e\u003cli\u003eSenior Associate | People Team\u003c/li\u003e\u003cli\u003eContent Specialist, Regional Operations\u003c/li\u003e\u003cli\u003eBusiness Development, Retail\u003c/li\u003e\u003cli\u003eShopeeFood - Campaign Planning\u003c/li\u003e\u003cli\u003eAssociate, Regional Operations | Logistics\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Earth Observatory of Singapore", "label": "Earth Observatory of Singapore", "shape": "dot", "size": 10, "title": "\u003cb\u003eEarth Observatory of Singapore\u003c/b\u003e - 5\u003cul\u003e\u003cli\u003ePHD Student\u003c/li\u003e\u003cli\u003ePostdoctoral Researcher\u003c/li\u003e\u003cli\u003eResearch Assistant\u003c/li\u003e\u003cli\u003eDirector of Research and Strategy\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "National Parks Board", "label": "National Parks Board", "shape": "dot", "size": 8, "title": "\u003cb\u003eNational Parks Board\u003c/b\u003e - 4\u003cul\u003e\u003cli\u003eWildlife Management Research Intern\u003c/li\u003e\u003cli\u003eManager (Parks)\u003c/li\u003e\u003cli\u003eManager/Skyrise Greenery\u003c/li\u003e\u003cli\u003eConservation Manager\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Enterprise Singapore", "label": "Enterprise Singapore", "shape": "dot", "size": 8, "title": "\u003cb\u003eEnterprise Singapore\u003c/b\u003e - 4\u003cul\u003e\u003cli\u003eMarine \u0026 Offshore Engineering Services Associate\u003c/li\u003e\u003cli\u003eManagement Associate\u003c/li\u003e\u003cli\u003eDeputy Director (Startup Development)\u003c/li\u003e\u003cli\u003eDevelopment Partner, Circular Economy \u0026 Sustainability\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Wildlife Reserves Singapore (WRS)", "label": "Wildlife Reserves Singapore (WRS)", "shape": "dot", "size": 8, "title": "\u003cb\u003eWildlife Reserves Singapore (WRS)\u003c/b\u003e - 4\u003cul\u003e\u003cli\u003eEducation Facilitator\u003c/li\u003e\u003cli\u003eEducational Camp Facilitator \u003c/li\u003e\u003cli\u003eManager\u003c/li\u003e\u003cli\u003eSenior Executive, Education\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Ministry of Education, Singapore (MOE)", "label": "Ministry of Education, Singapore (MOE)", "shape": "dot", "size": 8, "title": "\u003cb\u003eMinistry of Education, Singapore (MOE)\u003c/b\u003e - 4\u003cul\u003e\u003cli\u003eGeneral Education Officer\u003c/li\u003e\u003cli\u003ePhysical Education Teacher\u003c/li\u003e\u003cli\u003eTeacher\u003c/li\u003e\u003cli\u003eEnglish and Mathematics Teacher\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Mott MacDonald", "label": "Mott MacDonald", "shape": "dot", "size": 8, "title": "\u003cb\u003eMott MacDonald\u003c/b\u003e - 4\u003cul\u003e\u003cli\u003eGraduate Engineering Geologist\u003c/li\u003e\u003cli\u003eGraduate Environmental Consultant \u003c/li\u003e\u003cli\u003eGeotechnical Engineer Intern\u003c/li\u003e\u003cli\u003eGraduate engineering geologist\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Monetary Authority of Singapore (MAS)", "label": "Monetary Authority of Singapore (MAS)", "shape": "dot", "size": 8, "title": "\u003cb\u003eMonetary Authority of Singapore (MAS)\u003c/b\u003e - 4\u003cul\u003e\u003cli\u003eAssociate\u003c/li\u003e\u003cli\u003eUI/UX Designer - Transformation Division\u003c/li\u003e\u003cli\u003eHFT Algorithmic Trading Project Intern\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Asian School of the Environment", "label": "Asian School of the Environment", "shape": "dot", "size": 6, "title": "\u003cb\u003eAsian School of the Environment\u003c/b\u003e - 3\u003cul\u003e\u003cli\u003eGraduate Student\u003c/li\u003e\u003cli\u003eAlumni Association Liaison Officer, ASE Club\u003c/li\u003e\u003cli\u003eStudent Research Assistant\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Mandai Wildlife Group", "label": "Mandai Wildlife Group", "shape": "dot", "size": 6, "title": "\u003cb\u003eMandai Wildlife Group\u003c/b\u003e - 3\u003cul\u003e\u003cli\u003eAssistant Manager\u003c/li\u003e\u003cli\u003eWildlife Experience Curator, Sales and Experience Development\u003c/li\u003e\u003cli\u003eManager\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Temasek", "label": "Temasek", "shape": "dot", "size": 6, "title": "\u003cb\u003eTemasek\u003c/b\u003e - 3\u003cul\u003e\u003cli\u003eAssociate, Impact Investing\u003c/li\u003e\u003cli\u003eInvestment Services Associate\u003c/li\u003e\u003cli\u003eNature-based Solutions Associate\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Nanyang Technological University, Singapore", "label": "Nanyang Technological University, Singapore", "shape": "dot", "size": 6, "title": "\u003cb\u003eNanyang Technological University, Singapore\u003c/b\u003e - 3\u003cul\u003e\u003cli\u003eAssistant Professor\u003c/li\u003e\u003cli\u003eCareer Consultant\u003c/li\u003e\u003cli\u003eAssociate Professor\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "National University of Singapore", "label": "National University of Singapore", "shape": "dot", "size": 6, "title": "\u003cb\u003eNational University of Singapore\u003c/b\u003e - 3\u003cul\u003e\u003cli\u003eResearch Assistant\u003c/li\u003e\u003cli\u003eAssociate Researcher\u003c/li\u003e\u003cli\u003ePhD Candidate\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Citi", "label": "Citi", "shape": "dot", "size": 6, "title": "\u003cb\u003eCiti\u003c/b\u003e - 3\u003cul\u003e\u003cli\u003eInstitutional Investor Sales\u003c/li\u003e\u003cli\u003eManagement Associate, Global Consumer Banking\u003c/li\u003e\u003cli\u003eInvestments Product Manager\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "JTC Corporation", "label": "JTC Corporation", "shape": "dot", "size": 4, "title": "\u003cb\u003eJTC Corporation\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eAssistant Manager (Land Resource)\u003c/li\u003e\u003cli\u003eAssistant Manager (Land Resource Management)\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "SCELSE", "label": "SCELSE", "shape": "dot", "size": 4, "title": "\u003cb\u003eSCELSE\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eAssistant Manager, Science Communications\u003c/li\u003e\u003cli\u003ePHD Candidate\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "PUB, Singapore\u0027s National Water Agency", "label": "PUB, Singapore\u0027s National Water Agency", "shape": "dot", "size": 4, "title": "\u003cb\u003ePUB, Singapore\u0027s National Water Agency\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003ePlanner (Coastal Protection)\u003c/li\u003e\u003cli\u003eEngineer\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Conservation International", "label": "Conservation International", "shape": "dot", "size": 4, "title": "\u003cb\u003eConservation International\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eBushfire Mapping Intern\u003c/li\u003e\u003cli\u003eCommunications Intern\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "AbbVie", "label": "AbbVie", "shape": "dot", "size": 4, "title": "\u003cb\u003eAbbVie\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eQuality Control (API)\u003c/li\u003e\u003cli\u003eAssociate Scientist\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "UBS", "label": "UBS", "shape": "dot", "size": 4, "title": "\u003cb\u003eUBS\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eESG Analyst, APAC Sustainable Finance Office\u003c/li\u003e\u003cli\u003eWM - Investment Performance and Risk Analytics Reporting Specialist\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "National Institute of Education, Singapore", "label": "National Institute of Education, Singapore", "shape": "dot", "size": 4, "title": "\u003cb\u003eNational Institute of Education, Singapore\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eMaster\u0027s Candidate\u003c/li\u003e\u003cli\u003eResearch Assistant\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "ThoughtWorks", "label": "ThoughtWorks", "shape": "dot", "size": 4, "title": "\u003cb\u003eThoughtWorks\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eLead Consultant\u003c/li\u003e\u003cli\u003eProduct Designer\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Kalco Law LLC", "label": "Kalco Law LLC", "shape": "dot", "size": 4, "title": "\u003cb\u003eKalco Law LLC\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eLegal Associate\u003c/li\u003e\u003cli\u003eLegal Intern\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "ERCE", "label": "ERCE", "shape": "dot", "size": 4, "title": "\u003cb\u003eERCE\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eGraduate Geoscientist\u003c/li\u003e\u003cli\u003eGeoscientist\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "ERM: Environmental Resources Management", "label": "ERM: Environmental Resources Management", "shape": "dot", "size": 4, "title": "\u003cb\u003eERM: Environmental Resources Management\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eEnvironmental Analyst (CLE)\u003c/li\u003e\u003cli\u003eEnvironmental Consultant\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Singapore Food Agency", "label": "Singapore Food Agency", "shape": "dot", "size": 4, "title": "\u003cb\u003eSingapore Food Agency\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eManager\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Security \u0026 Risk Solutions Pte Ltd", "label": "Security \u0026 Risk Solutions Pte Ltd", "shape": "dot", "size": 4, "title": "\u003cb\u003eSecurity \u0026 Risk Solutions Pte Ltd\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eCrisis Response Analyst - APAC | Facebook\u003c/li\u003e\u003cli\u003eCrisis Response Associate Operations Lead APAC @ Facebook\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Udders Ice Cream", "label": "Udders Ice Cream", "shape": "dot", "size": 4, "title": "\u003cb\u003eUdders Ice Cream\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eBusiness Development Director\u003c/li\u003e\u003cli\u003eBusiness Development \u0026 Events \u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Schneider Electric", "label": "Schneider Electric", "shape": "dot", "size": 4, "title": "\u003cb\u003eSchneider Electric\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eProject Engineer\u003c/li\u003e\u003cli\u003eSchneider Graduate Programme(SGP) Associate\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Esri Singapore", "label": "Esri Singapore", "shape": "dot", "size": 4, "title": "\u003cb\u003eEsri Singapore\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eSolution Engineer\u003c/li\u003e\u003cli\u003eSolutions Engineer\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "TEMBUSU Asia Consulting", "label": "TEMBUSU Asia Consulting", "shape": "dot", "size": 4, "title": "\u003cb\u003eTEMBUSU Asia Consulting\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eEnvironmental Consultant\u003c/li\u003e\u003cli\u003eConsultant\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Ministry of Sustainability and the Environment, Singapore", "label": "Ministry of Sustainability and the Environment, Singapore", "shape": "dot", "size": 4, "title": "\u003cb\u003eMinistry of Sustainability and the Environment, Singapore\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eSenior Executive\u003c/li\u003e\u003cli\u003eSenior Exective\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "TikTok", "label": "TikTok", "shape": "dot", "size": 4, "title": "\u003cb\u003eTikTok\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eData Analyst\u003c/li\u003e\u003cli\u003eLivestream Campaigns \u0026 Creator Community\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Ministry of Trade and Industry (Singapore)", "label": "Ministry of Trade and Industry (Singapore)", "shape": "dot", "size": 4, "title": "\u003cb\u003eMinistry of Trade and Industry (Singapore)\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eAssistant Manager\u003c/li\u003e\u003cli\u003eAssistant Director\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Micron Technology", "label": "Micron Technology", "shape": "dot", "size": 4, "title": "\u003cb\u003eMicron Technology\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eSAP Business Process Analyst\u003c/li\u003e\u003cli\u003eProgram Management Office (PMO) Engineer\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Arup", "label": "Arup", "shape": "dot", "size": 4, "title": "\u003cb\u003eArup\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eGeographic Information Systems Consultant\u003c/li\u003e\u003cli\u003eEngineering Geologist\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "OCBC Bank", "label": "OCBC Bank", "shape": "dot", "size": 4, "title": "\u003cb\u003eOCBC Bank\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eAssistant Manager, OCBC Graduate Talent Programme\u003c/li\u003e\u003cli\u003eRisk Policy Executive\u003c/li\u003e\u003c/ul\u003e"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "foodpanda", "label": "foodpanda", "shape": "dot", "size": 4, "title": "\u003cb\u003efoodpanda\u003c/b\u003e - 2\u003cul\u003e\u003cli\u003eData Engineer\u003c/li\u003e\u003cli\u003eAssociate Category Lead - Fresh\u003c/li\u003e\u003c/ul\u003e"}]);
        edges = new vis.DataSet([{"color": "grey", "from": "root", "to": "Yara International", "weight": 1}, {"color": "grey", "from": "root", "to": "Archisen", "weight": 1}, {"color": "grey", "from": "root", "to": "Nanyang Technological University", "weight": 1}, {"color": "grey", "from": "root", "to": "DBS Bank", "weight": 1}, {"color": "grey", "from": "root", "to": "Shopee", "weight": 1}, {"color": "grey", "from": "root", "to": "Earth Observatory of Singapore", "weight": 1}, {"color": "grey", "from": "root", "to": "National Parks Board", "weight": 1}, {"color": "grey", "from": "root", "to": "Enterprise Singapore", "weight": 1}, {"color": "grey", "from": "root", "to": "Wildlife Reserves Singapore (WRS)", "weight": 1}, {"color": "grey", "from": "root", "to": "Ministry of Education, Singapore (MOE)", "weight": 1}, {"color": "grey", "from": "root", "to": "Mott MacDonald", "weight": 1}, {"color": "grey", "from": "root", "to": "Monetary Authority of Singapore (MAS)", "weight": 1}, {"color": "grey", "from": "root", "to": "Asian School of the Environment", "weight": 1}, {"color": "grey", "from": "root", "to": "Mandai Wildlife Group", "weight": 1}, {"color": "grey", "from": "root", "to": "Temasek", "weight": 1}, {"color": "grey", "from": "root", "to": "Nanyang Technological University, Singapore", "weight": 1}, {"color": "grey", "from": "root", "to": "National University of Singapore", "weight": 1}, {"color": "grey", "from": "root", "to": "Citi", "weight": 1}, {"color": "grey", "from": "root", "to": "JTC Corporation", "weight": 1}, {"color": "grey", "from": "root", "to": "SCELSE", "weight": 1}, {"color": "grey", "from": "root", "to": "PUB, Singapore\u0027s National Water Agency", "weight": 1}, {"color": "grey", "from": "root", "to": "Conservation International", "weight": 1}, {"color": "grey", "from": "root", "to": "AbbVie", "weight": 1}, {"color": "grey", "from": "root", "to": "UBS", "weight": 1}, {"color": "grey", "from": "root", "to": "National Institute of Education, Singapore", "weight": 1}, {"color": "grey", "from": "root", "to": "ThoughtWorks", "weight": 1}, {"color": "grey", "from": "root", "to": "Kalco Law LLC", "weight": 1}, {"color": "grey", "from": "root", "to": "ERCE", "weight": 1}, {"color": "grey", "from": "root", "to": "ERM: Environmental Resources Management", "weight": 1}, {"color": "grey", "from": "root", "to": "Singapore Food Agency", "weight": 1}, {"color": "grey", "from": "root", "to": "Security \u0026 Risk Solutions Pte Ltd", "weight": 1}, {"color": "grey", "from": "root", "to": "Udders Ice Cream", "weight": 1}, {"color": "grey", "from": "root", "to": "Schneider Electric", "weight": 1}, {"color": "grey", "from": "root", "to": "Esri Singapore", "weight": 1}, {"color": "grey", "from": "root", "to": "TEMBUSU Asia Consulting", "weight": 1}, {"color": "grey", "from": "root", "to": "Ministry of Sustainability and the Environment, Singapore", "weight": 1}, {"color": "grey", "from": "root", "to": "TikTok", "weight": 1}, {"color": "grey", "from": "root", "to": "Ministry of Trade and Industry (Singapore)", "weight": 1}, {"color": "grey", "from": "root", "to": "Micron Technology", "weight": 1}, {"color": "grey", "from": "root", "to": "Arup", "weight": 1}, {"color": "grey", "from": "root", "to": "OCBC Bank", "weight": 1}, {"color": "grey", "from": "root", "to": "foodpanda", "weight": 1}]);

        // adding nodes and edges to the graph
        data = {nodes: nodes, edges: edges};

        var options = {
    "configure": {
        "enabled": false
    },
    "edges": {
        "color": {
            "inherit": true
        },
        "smooth": {
            "enabled": false,
            "type": "continuous"
        }
    },
    "interaction": {
        "dragNodes": true,
        "hideEdgesOnDrag": false,
        "hideNodesOnDrag": false
    },
    "physics": {
        "enabled": true,
        "hierarchicalRepulsion": {
            "centralGravity": 0.0,
            "damping": 0.09,
            "nodeDistance": 120,
            "springConstant": 0.01,
            "springLength": 100
        },
        "solver": "hierarchicalRepulsion",
        "stabilization": {
            "enabled": true,
            "fit": true,
            "iterations": 1000,
            "onlyDynamicEdges": false,
            "updateInterval": 50
        }
    }
};





        network = new vis.Network(container, data, options);






        return network;

    }

    drawGraph();

</script>
</body>
</html>

<html>
<head>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis.css" type="text/css" />
<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/vis/4.16.1/vis-network.min.js"> </script>
<center>
<h1></h1>
</center>

<!-- <link rel="stylesheet" href="../node_modules/vis/dist/vis.min.css" type="text/css" />
<script type="text/javascript" src="../node_modules/vis/dist/vis.js"> </script>-->

<style type="text/css">

        #mynetwork {
            width: 700px;
            height: 700px;
            background-color: black;
            border: 1px solid lightgray;
            position: relative;
            float: left;
        }






</style>

</head>

<body>
<div id="mynetwork"></div>


<script type="text/javascript">

    // initialize global variables.
    var edges;
    var nodes;
    var network; 
    var container;
    var options, data;


    // This method is responsible for drawing the graph, returns the drawn network
    function drawGraph() {
        var container = document.getElementById('mynetwork');



        // parsing and collecting nodes and edges from the python
        nodes = new vis.DataSet([{"font": {"color": "white"}, "id": "root", "label": "root", "shape": "dot", "size": 10}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Research Assistant", "label": "Research Assistant", "shape": "dot", "size": 8, "title": "8"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Data Scientist", "label": "Data Scientist", "shape": "dot", "size": 6, "title": "6"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Software Engineer", "label": "Software Engineer", "shape": "dot", "size": 5, "title": "5"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Consultant", "label": "Consultant", "shape": "dot", "size": 5, "title": "5"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Data Analyst", "label": "Data Analyst", "shape": "dot", "size": 4, "title": "4"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Senior Executive", "label": "Senior Executive", "shape": "dot", "size": 4, "title": "4"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Manager", "label": "Manager", "shape": "dot", "size": 4, "title": "4"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Engineer", "label": "Engineer", "shape": "dot", "size": 3, "title": "3"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Co-Founder", "label": "Co-Founder", "shape": "dot", "size": 3, "title": "3"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Senior Consultant", "label": "Senior Consultant", "shape": "dot", "size": 3, "title": "3"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Data Science Intern", "label": "Data Science Intern", "shape": "dot", "size": 3, "title": "3"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Environmental Consultant", "label": "Environmental Consultant", "shape": "dot", "size": 3, "title": "3"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Graduate Student", "label": "Graduate Student", "shape": "dot", "size": 3, "title": "3"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "PHD Candidate", "label": "PHD Candidate", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Chief Executive Officer", "label": "Chief Executive Officer", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Associate Data Analyst", "label": "Associate Data Analyst", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "PHD Student", "label": "PHD Student", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Sustainability Analyst", "label": "Sustainability Analyst", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Postdoctoral Researcher", "label": "Postdoctoral Researcher", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "PhD Candidate", "label": "PhD Candidate", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Catastrophe Risk Analyst", "label": "Catastrophe Risk Analyst", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Business Development Executive", "label": "Business Development Executive", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "President", "label": "President", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Product Designer", "label": "Product Designer", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Project Engineer", "label": "Project Engineer", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Founder", "label": "Founder", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Business Development Manager", "label": "Business Development Manager", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Associate", "label": "Associate", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Business Development Associate", "label": "Business Development Associate", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Data Engineer", "label": "Data Engineer", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Senior Data Analyst", "label": "Senior Data Analyst", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Management Associate", "label": "Management Associate", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Assistant Manager", "label": "Assistant Manager", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "Intern", "label": "Intern", "shape": "dot", "size": 2, "title": "2"}, {"color": "#3449eb", "font": {"color": "white"}, "id": "UX Designer", "label": "UX Designer", "shape": "dot", "size": 2, "title": "2"}]);
        edges = new vis.DataSet([{"color": "grey", "from": "root", "to": "Research Assistant", "weight": 1}, {"color": "grey", "from": "root", "to": "Data Scientist", "weight": 1}, {"color": "grey", "from": "root", "to": "Software Engineer", "weight": 1}, {"color": "grey", "from": "root", "to": "Consultant", "weight": 1}, {"color": "grey", "from": "root", "to": "Data Analyst", "weight": 1}, {"color": "grey", "from": "root", "to": "Senior Executive", "weight": 1}, {"color": "grey", "from": "root", "to": "Manager", "weight": 1}, {"color": "grey", "from": "root", "to": "Engineer", "weight": 1}, {"color": "grey", "from": "root", "to": "Co-Founder", "weight": 1}, {"color": "grey", "from": "root", "to": "Senior Consultant", "weight": 1}, {"color": "grey", "from": "root", "to": "Data Science Intern", "weight": 1}, {"color": "grey", "from": "root", "to": "Environmental Consultant", "weight": 1}, {"color": "grey", "from": "root", "to": "Graduate Student", "weight": 1}, {"color": "grey", "from": "root", "to": "PHD Candidate", "weight": 1}, {"color": "grey", "from": "root", "to": "Chief Executive Officer", "weight": 1}, {"color": "grey", "from": "root", "to": "Associate Data Analyst", "weight": 1}, {"color": "grey", "from": "root", "to": "PHD Student", "weight": 1}, {"color": "grey", "from": "root", "to": "Sustainability Analyst", "weight": 1}, {"color": "grey", "from": "root", "to": "Postdoctoral Researcher", "weight": 1}, {"color": "grey", "from": "root", "to": "PhD Candidate", "weight": 1}, {"color": "grey", "from": "root", "to": "Catastrophe Risk Analyst", "weight": 1}, {"color": "grey", "from": "root", "to": "Business Development Executive", "weight": 1}, {"color": "grey", "from": "root", "to": "President", "weight": 1}, {"color": "grey", "from": "root", "to": "Product Designer", "weight": 1}, {"color": "grey", "from": "root", "to": "Project Engineer", "weight": 1}, {"color": "grey", "from": "root", "to": "Founder", "weight": 1}, {"color": "grey", "from": "root", "to": "Business Development Manager", "weight": 1}, {"color": "grey", "from": "root", "to": "Associate", "weight": 1}, {"color": "grey", "from": "root", "to": "Business Development Associate", "weight": 1}, {"color": "grey", "from": "root", "to": "Data Engineer", "weight": 1}, {"color": "grey", "from": "root", "to": "Senior Data Analyst", "weight": 1}, {"color": "grey", "from": "root", "to": "Management Associate", "weight": 1}, {"color": "grey", "from": "root", "to": "Assistant Manager", "weight": 1}, {"color": "grey", "from": "root", "to": "Intern", "weight": 1}, {"color": "grey", "from": "root", "to": "UX Designer", "weight": 1}]);

        // adding nodes and edges to the graph
        data = {nodes: nodes, edges: edges};

        var options = {
    "configure": {
        "enabled": false
    },
    "edges": {
        "color": {
            "inherit": true
        },
        "smooth": {
            "enabled": false,
            "type": "continuous"
        }
    },
    "interaction": {
        "dragNodes": true,
        "hideEdgesOnDrag": false,
        "hideNodesOnDrag": false
    },
    "physics": {
        "enabled": true,
        "hierarchicalRepulsion": {
            "centralGravity": 0.0,
            "damping": 0.09,
            "nodeDistance": 120,
            "springConstant": 0.01,
            "springLength": 100
        },
        "solver": "hierarchicalRepulsion",
        "stabilization": {
            "enabled": true,
            "fit": true,
            "iterations": 1000,
            "onlyDynamicEdges": false,
            "updateInterval": 50
        }
    }
};





        network = new vis.Network(container, data, options);






        return network;

    }

    drawGraph();

</script>
</body>
</html>]]></content><author><name>Blog Author</name></author><category term="Data Analytics" /><category term="Networkx" /><category term="LinkedIn" /><summary type="html"><![CDATA[Using networkx and plotly to visualise my LinkedIn network.]]></summary></entry><entry><title type="html">Rule-based Sentiment Analysis on Syfe, Stashaway and Endowus</title><link href="http://www.layonsan.com/multiple-rule-based-sentiment-analysis/" rel="alternate" type="text/html" title="Rule-based Sentiment Analysis on Syfe, Stashaway and Endowus" /><published>2021-09-24T00:00:00+00:00</published><updated>2021-09-24T00:00:00+00:00</updated><id>http://www.layonsan.com/multiple-rule-based-sentiment-analysis</id><content type="html" xml:base="http://www.layonsan.com/multiple-rule-based-sentiment-analysis/"><![CDATA[<p>Are all app reviews created equal? I put investment platforms — Syfe, StashAway, and Endowus — under the microscope using three lexicon-based sentiment tools: TextBlob, VADER, and SentiWordNet. Each one tells a slightly different story about what users love (or don’t), and together they reveal the hidden tone behind the feedback.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">regex</span> <span class="k">as</span> <span class="n">re</span>
<span class="kn">import</span> <span class="nn">warnings</span>
<span class="n">warnings</span><span class="p">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s">'ignore'</span><span class="p">)</span>

</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># import file
</span><span class="n">app_reviews</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'app_reviews.csv'</span><span class="p">)</span>
<span class="n">app_reviews</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>app_name</th>
      <th>content</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Syfe</td>
      <td>1. The portfolio “card user interface” can be ...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Syfe</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Syfe</td>
      <td>The app and website is just a bunch of fake li...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Syfe</td>
      <td>The app looks fantastic and it’s so fresh with...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Syfe</td>
      <td>Hi there,\n\nThe app checks for latest version...</td>
    </tr>
  </tbody>
</table>
</div>

<h1 id="1-data-preprocessing">1. Data Preprocessing</h1>
<p>Data preprocessing steps:</p>

<p>a. Cleaning the text<br />
b. Tokenization <br />
c. Enrichment – POS tagging <br />
d. Stopwords removal<br />
e. Obtaining the stem words</p>

<h2 id="1a-cleaning-the-text">1a. Cleaning the Text</h2>

<p>Remove the special characters, numbers from the review text using regex</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Define a function to clean the text
</span><span class="k">def</span> <span class="nf">clean</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
<span class="c1"># Removes all special characters and numericals leaving the alphabets
</span>    <span class="n">text</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="s">'[^A-Za-z]+'</span><span class="p">,</span> <span class="s">' '</span><span class="p">,</span> <span class="n">text</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">text</span>

<span class="c1"># Cleaning the text in the review column
</span><span class="n">app_reviews</span><span class="p">[</span><span class="s">'cleaned_reviews'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_reviews</span><span class="p">[</span><span class="s">'content'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">clean</span><span class="p">)</span>
<span class="n">app_reviews</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>app_name</th>
      <th>content</th>
      <th>cleaned_reviews</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Syfe</td>
      <td>1. The portfolio “card user interface” can be ...</td>
      <td>The portfolio card user interface can be inco...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Syfe</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Syfe</td>
      <td>The app and website is just a bunch of fake li...</td>
      <td>The app and website is just a bunch of fake li...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Syfe</td>
      <td>The app looks fantastic and it’s so fresh with...</td>
      <td>The app looks fantastic and it s so fresh with...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Syfe</td>
      <td>Hi there,\n\nThe app checks for latest version...</td>
      <td>Hi there The app checks for latest version dur...</td>
    </tr>
  </tbody>
</table>
</div>

<h2 id="1b-tokenisation">1b. Tokenisation</h2>

<p>Using nltk tokenize function word_tokenize() to perform word-level tokenization</p>

<h2 id="1c-enrichment--pos-tagging">1c. Enrichment – POS tagging</h2>

<p>Using the nltk pos_tag function to perform Parts of Speech (POS) tagging - converting each token into a tuple having the form (word, tag). POS tagging essential to preserve the context of the word and is essential for Lemmatization</p>

<h2 id="1d-stopwords-removal">1d. Stopwords removal</h2>
<p>Stopwords in English are words that carry very little useful information. We need to remove them as part of text preprocessing. nltk has a list of stopwords of every language.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nltk</span>
<span class="kn">from</span> <span class="nn">nltk.tokenize</span> <span class="kn">import</span> <span class="n">word_tokenize</span>
<span class="c1"># Download punkt resource if unavailable
# nltk.download('punkt') 
</span>
<span class="kn">from</span> <span class="nn">nltk.tag</span> <span class="kn">import</span> <span class="n">pos_tag</span>
<span class="c1"># Download averaged_perceptron_tagger resource if unavailable
# nltk.download('averaged_perceptron_tagger')
</span>
<span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">stopwords</span>
<span class="c1"># nltk.download('stopwords')
</span><span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">wordnet</span>
<span class="c1"># Download wordnet resource if unavailable
# nltk.download('wordnet')
</span></code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## POS tagger dictionary
# To obtain the accurate Lemma, the WordNetLemmatizer requires POS tags in the form of ‘n’, ‘a’, etc. 
# But the POS tags obtained from pos_tag are in the form of ‘NN’, ‘ADJ’, etc.
# To map pos_tag to wordnet tags, we created a dictionary pos_dict. 
# Any pos_tag that starts with J is mapped to wordnet.ADJ, any pos_tag that starts with R is mapped to wordnet.ADV, and so on.
# Our tags of interest are Noun, Adjective, Adverb, Verb. Anything out of these four is mapped to None.
</span><span class="n">pos_dict</span> <span class="o">=</span> <span class="p">{</span><span class="s">'J'</span><span class="p">:</span><span class="n">wordnet</span><span class="p">.</span><span class="n">ADJ</span><span class="p">,</span> <span class="s">'V'</span><span class="p">:</span><span class="n">wordnet</span><span class="p">.</span><span class="n">VERB</span><span class="p">,</span> <span class="s">'N'</span><span class="p">:</span><span class="n">wordnet</span><span class="p">.</span><span class="n">NOUN</span><span class="p">,</span> <span class="s">'R'</span><span class="p">:</span><span class="n">wordnet</span><span class="p">.</span><span class="n">ADV</span><span class="p">}</span>

<span class="k">def</span> <span class="nf">token_stop_pos</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="n">tags</span> <span class="o">=</span> <span class="n">pos_tag</span><span class="p">(</span><span class="n">word_tokenize</span><span class="p">(</span><span class="n">text</span><span class="p">))</span> <span class="c1"># tokenise the reviews, and pos tag the tokens
</span>    <span class="n">newlist</span> <span class="o">=</span> <span class="p">[]</span> <span class="c1"># create empty list to append tags to the words
</span>    <span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">tag</span> <span class="ow">in</span> <span class="n">tags</span><span class="p">:</span> <span class="c1"># interate through the tuples (word:pos tag) in tags
</span>        <span class="k">if</span> <span class="n">word</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span> <span class="ow">not</span> <span class="ow">in</span> <span class="nb">set</span><span class="p">(</span><span class="n">stopwords</span><span class="p">.</span><span class="n">words</span><span class="p">(</span><span class="s">'english'</span><span class="p">)):</span> <span class="c1"># remove stop words
</span>            <span class="n">newlist</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="nb">tuple</span><span class="p">([</span><span class="n">word</span><span class="p">,</span> <span class="n">pos_dict</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">tag</span><span class="p">[</span><span class="mi">0</span><span class="p">])]))</span> <span class="c1"># append new pos tags in the correct form by mapping to pos_dict
</span>    <span class="k">return</span> <span class="n">newlist</span>

<span class="n">app_reviews</span><span class="p">[</span><span class="s">'pos_tagged'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_reviews</span><span class="p">[</span><span class="s">'cleaned_reviews'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">token_stop_pos</span><span class="p">)</span> <span class="c1"># apply token_stop_pos function to the reviews
</span><span class="n">app_reviews</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>app_name</th>
      <th>content</th>
      <th>cleaned_reviews</th>
      <th>pos_tagged</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Syfe</td>
      <td>1. The portfolio “card user interface” can be ...</td>
      <td>The portfolio card user interface can be inco...</td>
      <td>[(portfolio, n), (card, n), (user, None), (int...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Syfe</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
      <td>[(hybrid, a), (app, n), (quite, r), (buggy, a)...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Syfe</td>
      <td>The app and website is just a bunch of fake li...</td>
      <td>The app and website is just a bunch of fake li...</td>
      <td>[(app, n), (website, n), (bunch, n), (fake, a)...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Syfe</td>
      <td>The app looks fantastic and it’s so fresh with...</td>
      <td>The app looks fantastic and it s so fresh with...</td>
      <td>[(app, n), (looks, v), (fantastic, a), (fresh,...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Syfe</td>
      <td>Hi there,\n\nThe app checks for latest version...</td>
      <td>Hi there The app checks for latest version dur...</td>
      <td>[(Hi, n), (app, n), (checks, n), (latest, a), ...</td>
    </tr>
  </tbody>
</table>
</div>

<h2 id="1e-obtaining-the-stem-words">1e. Obtaining the stem words</h2>
<p>A stem is a part of a word responsible for its lexical meaning. The two popular techniques of obtaining the root/stem words are Stemming and Lemmatization.</p>

<p>The key difference is Stemming often gives some meaningless root words as it simply chops off some characters in the end. Lemmatization gives meaningful root words, however, it requires POS tags of the words.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">nltk.stem</span> <span class="kn">import</span> <span class="n">WordNetLemmatizer</span>

<span class="n">wordnet_lemmatizer</span> <span class="o">=</span> <span class="n">WordNetLemmatizer</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">lemmatize</span><span class="p">(</span><span class="n">pos_data</span><span class="p">):</span>
    <span class="n">lemma_rew</span> <span class="o">=</span> <span class="s">" "</span> <span class="c1"># create empoty string
</span>    <span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">pos</span> <span class="ow">in</span> <span class="n">pos_data</span><span class="p">:</span> <span class="c1"># iterate through tuples (word,POS tag)
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">pos</span><span class="p">:</span> 
            <span class="n">lemma</span> <span class="o">=</span> <span class="n">word</span>
            <span class="n">lemma_rew</span> <span class="o">=</span> <span class="n">lemma_rew</span> <span class="o">+</span> <span class="s">" "</span> <span class="o">+</span> <span class="n">lemma</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">lemma</span> <span class="o">=</span> <span class="n">wordnet_lemmatizer</span><span class="p">.</span><span class="n">lemmatize</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">pos</span><span class="o">=</span><span class="n">pos</span><span class="p">)</span>
            <span class="n">lemma_rew</span> <span class="o">=</span> <span class="n">lemma_rew</span> <span class="o">+</span> <span class="s">" "</span> <span class="o">+</span> <span class="n">lemma</span>
    <span class="k">return</span> <span class="n">lemma_rew</span>

<span class="n">app_reviews</span><span class="p">[</span><span class="s">'Lemma'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_reviews</span><span class="p">[</span><span class="s">'pos_tagged'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">lemmatize</span><span class="p">)</span>
<span class="n">app_reviews</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>app_name</th>
      <th>content</th>
      <th>cleaned_reviews</th>
      <th>pos_tagged</th>
      <th>Lemma</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Syfe</td>
      <td>1. The portfolio “card user interface” can be ...</td>
      <td>The portfolio card user interface can be inco...</td>
      <td>[(portfolio, n), (card, n), (user, None), (int...</td>
      <td>portfolio card user interface inconvenient m...</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Syfe</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
      <td>[(hybrid, a), (app, n), (quite, r), (buggy, a)...</td>
      <td>hybrid app quite buggy compare Stashaway How...</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Syfe</td>
      <td>The app and website is just a bunch of fake li...</td>
      <td>The app and website is just a bunch of fake li...</td>
      <td>[(app, n), (website, n), (bunch, n), (fake, a)...</td>
      <td>app website bunch fake lie Starting onboardi...</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Syfe</td>
      <td>The app looks fantastic and it’s so fresh with...</td>
      <td>The app looks fantastic and it s so fresh with...</td>
      <td>[(app, n), (looks, v), (fantastic, a), (fresh,...</td>
      <td>app look fantastic fresh different color muc...</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Syfe</td>
      <td>Hi there,\n\nThe app checks for latest version...</td>
      <td>Hi there The app checks for latest version dur...</td>
      <td>[(Hi, n), (app, n), (checks, n), (latest, a), ...</td>
      <td>Hi app check late version launch alert user ...</td>
    </tr>
  </tbody>
</table>
</div>

<h1 id="2-rule-based-sentiment-analysis">2. Rule-Based Sentiment Analysis</h1>

<p>a. TextBlob <br />
b. VADER<br />
c. SentiWordNet</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Creating a new data frame with the review, Lemma columns 
</span><span class="n">fin_data</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">app_reviews</span><span class="p">[[</span><span class="s">'app_name'</span><span class="p">,</span><span class="s">'cleaned_reviews'</span><span class="p">,</span> <span class="s">'Lemma'</span><span class="p">]])</span>
</code></pre></div></div>

<h2 id="2a-sentiment-analysis-using-textblob">2a. Sentiment Analysis using TextBlob</h2>

<ul>
  <li>Polarity – talks about how positive or negative the opinion is</li>
</ul>

<p>Polarity ranges from -1 to 1 (1 is more positive, 0 is neutral, -1 is more negative)</p>

<ul>
  <li>Subjectivity – talks about how subjective the opinion is</li>
</ul>

<p>Subjectivity ranges from 0 to 1(0 being very objective and 1 being very subjective)</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">textblob</span> <span class="kn">import</span> <span class="n">TextBlob</span>

<span class="c1"># function to calculate subjectivity
</span><span class="k">def</span> <span class="nf">getSubjectivity</span><span class="p">(</span><span class="n">review</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">TextBlob</span><span class="p">(</span><span class="n">review</span><span class="p">).</span><span class="n">sentiment</span><span class="p">.</span><span class="n">subjectivity</span>

<span class="c1"># function to calculate polarity
</span><span class="k">def</span> <span class="nf">getPolarity</span><span class="p">(</span><span class="n">review</span><span class="p">):</span>
    <span class="k">return</span> <span class="n">TextBlob</span><span class="p">(</span><span class="n">review</span><span class="p">).</span><span class="n">sentiment</span><span class="p">.</span><span class="n">polarity</span>

<span class="c1"># function to analyze the reviews
</span><span class="k">def</span> <span class="nf">analysis</span><span class="p">(</span><span class="n">score</span><span class="p">):</span>
    <span class="k">if</span> <span class="n">score</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'Negative'</span>
    <span class="k">elif</span> <span class="n">score</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'Neutral'</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="k">return</span> <span class="s">'Positive'</span>

<span class="c1"># Apply the above functions
</span><span class="n">fin_data</span><span class="p">[</span><span class="s">'subjectivity'</span><span class="p">]</span> <span class="o">=</span> <span class="n">fin_data</span><span class="p">[</span><span class="s">'Lemma'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">getSubjectivity</span><span class="p">)</span> 
<span class="n">fin_data</span><span class="p">[</span><span class="s">'polarity'</span><span class="p">]</span> <span class="o">=</span> <span class="n">fin_data</span><span class="p">[</span><span class="s">'Lemma'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">getPolarity</span><span class="p">)</span> 
<span class="n">fin_data</span><span class="p">[</span><span class="s">'textblob-analysis'</span><span class="p">]</span> <span class="o">=</span> <span class="n">fin_data</span><span class="p">[</span><span class="s">'polarity'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">analysis</span><span class="p">)</span>

<span class="n">fin_data</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>app_name</th>
      <th>cleaned_reviews</th>
      <th>Lemma</th>
      <th>subjectivity</th>
      <th>polarity</th>
      <th>textblob-analysis</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Syfe</td>
      <td>The portfolio card user interface can be inco...</td>
      <td>portfolio card user interface inconvenient m...</td>
      <td>0.436364</td>
      <td>0.236364</td>
      <td>Positive</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Syfe</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
      <td>hybrid app quite buggy compare Stashaway How...</td>
      <td>0.500000</td>
      <td>0.200000</td>
      <td>Positive</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Syfe</td>
      <td>The app and website is just a bunch of fake li...</td>
      <td>app website bunch fake lie Starting onboardi...</td>
      <td>0.465833</td>
      <td>-0.125000</td>
      <td>Negative</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Syfe</td>
      <td>The app looks fantastic and it s so fresh with...</td>
      <td>app look fantastic fresh different color muc...</td>
      <td>0.473333</td>
      <td>0.146667</td>
      <td>Positive</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Syfe</td>
      <td>Hi there The app checks for latest version dur...</td>
      <td>Hi app check late version launch alert user ...</td>
      <td>0.551515</td>
      <td>-0.154545</td>
      <td>Negative</td>
    </tr>
  </tbody>
</table>
</div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tb_counts</span> <span class="o">=</span> <span class="n">fin_data</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'app_name'</span><span class="p">,</span><span class="s">'textblob-analysis'</span><span class="p">]).</span><span class="n">size</span><span class="p">()</span>
<span class="n">tb_counts</span>

</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>app_name   textblob-analysis
Endowus    Negative                5
           Neutral                12
           Positive              193
StashAway  Negative               97
           Neutral               157
           Positive             1401
Syfe       Negative               30
           Neutral                42
           Positive              102
dtype: int64
</code></pre></div></div>

<h2 id="2b-sentiment-analysis-using-vader">2b. Sentiment Analysis using VADER</h2>

<ul>
  <li>positive if compound &gt;= 0.5</li>
  <li>neutral if -0.5 &lt; compound &lt; 0.5</li>
  <li>negative if -0.5 &gt;= compound</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">vaderSentiment.vaderSentiment</span> <span class="kn">import</span> <span class="n">SentimentIntensityAnalyzer</span>
<span class="n">analyzer</span> <span class="o">=</span> <span class="n">SentimentIntensityAnalyzer</span><span class="p">()</span>

<span class="c1"># Function to return sentiment based on input text. Sentiment label consist of:
</span><span class="k">def</span> <span class="nf">calc_vader_sentiment</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="n">vs</span> <span class="o">=</span> <span class="n">analyzer</span><span class="p">.</span><span class="n">polarity_scores</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">text</span><span class="p">))</span>
    <span class="n">compound</span> <span class="o">=</span> <span class="n">vs</span><span class="p">[</span><span class="s">'compound'</span><span class="p">]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">compound</span> <span class="o">&gt;=</span> <span class="mf">0.5</span><span class="p">):</span>
        <span class="n">sentiment</span> <span class="o">=</span> <span class="s">'Positive'</span>
    <span class="k">elif</span><span class="p">(</span><span class="n">compound</span> <span class="o">&lt;=</span> <span class="o">-</span><span class="mf">0.5</span><span class="p">):</span>
        <span class="n">sentiment</span> <span class="o">=</span> <span class="s">'Negative'</span>
    <span class="k">else</span><span class="p">:</span>
        <span class="n">sentiment</span> <span class="o">=</span> <span class="s">'Neutral'</span>
    <span class="k">return</span> <span class="n">sentiment</span>

<span class="n">fin_data</span><span class="p">[</span><span class="s">'vader-analysis'</span><span class="p">]</span> <span class="o">=</span> <span class="n">fin_data</span><span class="p">[</span><span class="s">'Lemma'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">calc_vader_sentiment</span><span class="p">)</span>
<span class="n">fin_data</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>app_name</th>
      <th>cleaned_reviews</th>
      <th>Lemma</th>
      <th>subjectivity</th>
      <th>polarity</th>
      <th>textblob-analysis</th>
      <th>vader-analysis</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Syfe</td>
      <td>The portfolio card user interface can be inco...</td>
      <td>portfolio card user interface inconvenient m...</td>
      <td>0.436364</td>
      <td>0.236364</td>
      <td>Positive</td>
      <td>Positive</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Syfe</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
      <td>hybrid app quite buggy compare Stashaway How...</td>
      <td>0.500000</td>
      <td>0.200000</td>
      <td>Positive</td>
      <td>Positive</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Syfe</td>
      <td>The app and website is just a bunch of fake li...</td>
      <td>app website bunch fake lie Starting onboardi...</td>
      <td>0.465833</td>
      <td>-0.125000</td>
      <td>Negative</td>
      <td>Negative</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Syfe</td>
      <td>The app looks fantastic and it s so fresh with...</td>
      <td>app look fantastic fresh different color muc...</td>
      <td>0.473333</td>
      <td>0.146667</td>
      <td>Positive</td>
      <td>Positive</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Syfe</td>
      <td>Hi there The app checks for latest version dur...</td>
      <td>Hi app check late version launch alert user ...</td>
      <td>0.551515</td>
      <td>-0.154545</td>
      <td>Negative</td>
      <td>Positive</td>
    </tr>
  </tbody>
</table>
</div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vd_counts</span> <span class="o">=</span> <span class="n">fin_data</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'app_name'</span><span class="p">,</span><span class="s">'vader-analysis'</span><span class="p">]).</span><span class="n">size</span><span class="p">()</span>
<span class="n">vd_counts</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>app_name   vader-analysis
Endowus    Neutral             44
           Positive           166
StashAway  Negative            30
           Neutral            466
           Positive          1159
Syfe       Negative            10
           Neutral            103
           Positive            61
dtype: int64
</code></pre></div></div>

<h2 id="2c-sentiment-analysis-using-sentiwordnet">2c. Sentiment Analysis using SentiWordNet</h2>

<ul>
  <li>if positive score &gt; negative score, the sentiment is positive</li>
  <li>if positive score &lt; negative score, the sentiment is negative</li>
  <li>if positive score = negative score, the sentiment is neutral</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Download sentiwordnet resource if unavailable
# nltk.download('sentiwordnet')
</span><span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">sentiwordnet</span> <span class="k">as</span> <span class="n">swn</span>

<span class="k">def</span> <span class="nf">sentiwordnetanalysis</span><span class="p">(</span><span class="n">pos_data</span><span class="p">):</span>
    <span class="n">sentiment</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">tokens_count</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">for</span> <span class="n">word</span><span class="p">,</span> <span class="n">pos</span> <span class="ow">in</span> <span class="n">pos_data</span><span class="p">:</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">pos</span><span class="p">:</span>
            <span class="k">continue</span>
        <span class="n">lemma</span> <span class="o">=</span> <span class="n">wordnet_lemmatizer</span><span class="p">.</span><span class="n">lemmatize</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">pos</span><span class="o">=</span><span class="n">pos</span><span class="p">)</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">lemma</span><span class="p">:</span>
            <span class="k">continue</span>
        <span class="n">synsets</span> <span class="o">=</span> <span class="n">wordnet</span><span class="p">.</span><span class="n">synsets</span><span class="p">(</span><span class="n">lemma</span><span class="p">,</span> <span class="n">pos</span><span class="o">=</span><span class="n">pos</span><span class="p">)</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">synsets</span><span class="p">:</span>
            <span class="k">continue</span>
        <span class="c1"># Take the first sense, the most common
</span>        <span class="n">synset</span> <span class="o">=</span> <span class="n">synsets</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
        <span class="n">swn_synset</span> <span class="o">=</span> <span class="n">swn</span><span class="p">.</span><span class="n">senti_synset</span><span class="p">(</span><span class="n">synset</span><span class="p">.</span><span class="n">name</span><span class="p">())</span>
        <span class="n">sentiment</span> <span class="o">+=</span> <span class="n">swn_synset</span><span class="p">.</span><span class="n">pos_score</span><span class="p">()</span> <span class="o">-</span> <span class="n">swn_synset</span><span class="p">.</span><span class="n">neg_score</span><span class="p">()</span>
        <span class="n">tokens_count</span> <span class="o">+=</span> <span class="mi">1</span>
        <span class="c1"># print(swn_synset.pos_score(),swn_synset.neg_score(),swn_synset.obj_score())
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">tokens_count</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">0</span>
        <span class="k">if</span> <span class="n">sentiment</span><span class="o">&gt;</span><span class="mi">0</span><span class="p">:</span>
            <span class="k">return</span> <span class="s">"Positive"</span>
        <span class="k">if</span> <span class="n">sentiment</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span>
            <span class="k">return</span> <span class="s">"Neutral"</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">return</span> <span class="s">"Negative"</span>

<span class="n">fin_data</span><span class="p">[</span><span class="s">'swn-analysis'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_reviews</span><span class="p">[</span><span class="s">'pos_tagged'</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">sentiwordnetanalysis</span><span class="p">)</span>
<span class="n">fin_data</span><span class="p">.</span><span class="n">head</span><span class="p">()</span>
</code></pre></div></div>

<div>
<style scoped="">
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>app_name</th>
      <th>cleaned_reviews</th>
      <th>Lemma</th>
      <th>subjectivity</th>
      <th>polarity</th>
      <th>textblob-analysis</th>
      <th>vader-analysis</th>
      <th>swn-analysis</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Syfe</td>
      <td>The portfolio card user interface can be inco...</td>
      <td>portfolio card user interface inconvenient m...</td>
      <td>0.436364</td>
      <td>0.236364</td>
      <td>Positive</td>
      <td>Positive</td>
      <td>Neutral</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Syfe</td>
      <td>This hybrid app is quite buggy compared Stasha...</td>
      <td>hybrid app quite buggy compare Stashaway How...</td>
      <td>0.500000</td>
      <td>0.200000</td>
      <td>Positive</td>
      <td>Positive</td>
      <td>Neutral</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Syfe</td>
      <td>The app and website is just a bunch of fake li...</td>
      <td>app website bunch fake lie Starting onboardi...</td>
      <td>0.465833</td>
      <td>-0.125000</td>
      <td>Negative</td>
      <td>Negative</td>
      <td>Neutral</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Syfe</td>
      <td>The app looks fantastic and it s so fresh with...</td>
      <td>app look fantastic fresh different color muc...</td>
      <td>0.473333</td>
      <td>0.146667</td>
      <td>Positive</td>
      <td>Positive</td>
      <td>Neutral</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Syfe</td>
      <td>Hi there The app checks for latest version dur...</td>
      <td>Hi app check late version launch alert user ...</td>
      <td>0.551515</td>
      <td>-0.154545</td>
      <td>Negative</td>
      <td>Positive</td>
      <td>Neutral</td>
    </tr>
  </tbody>
</table>
</div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">swn_counts</span> <span class="o">=</span> <span class="n">fin_data</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'app_name'</span><span class="p">,</span><span class="s">'swn-analysis'</span><span class="p">]).</span><span class="n">size</span><span class="p">()</span>
<span class="n">swn_counts</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>app_name   swn-analysis
Endowus    Negative         12
           Neutral         134
           Positive         63
StashAway  Negative        130
           Neutral         983
           Positive        511
Syfe       Negative         19
           Neutral         117
           Positive         37
dtype: int64
</code></pre></div></div>

<h1 id="3-visualise-results">3. Visualise Results</h1>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Convert sentiment results from series into dataframes
</span><span class="n">tb_counts_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">tb_counts</span><span class="p">).</span><span class="n">reset_index</span><span class="p">().</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span><span class="s">'count'</span><span class="p">})</span>
<span class="n">vd_counts_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">vd_counts</span><span class="p">).</span><span class="n">reset_index</span><span class="p">().</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span><span class="s">'count'</span><span class="p">})</span>
<span class="n">swn_counts_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">swn_counts</span><span class="p">).</span><span class="n">reset_index</span><span class="p">().</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="mi">0</span><span class="p">:</span><span class="s">'count'</span><span class="p">})</span>

</code></pre></div></div>

<h3 id="absolute-comparison">Absolute Comparison</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sns</span><span class="p">.</span><span class="n">set_style</span><span class="p">(</span> <span class="s">'darkgrid'</span> <span class="p">)</span>
<span class="n">col</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"Set2"</span><span class="p">)</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span><span class="mi">8</span><span class="p">])</span>
<span class="n">fig</span><span class="p">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">'Rule Based Sentiment Analysis on Syfe, Endowus and StashAway'</span><span class="p">)</span>

<span class="c1">## Plot 1
</span><span class="n">sns</span><span class="p">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">data</span><span class="o">=</span><span class="n">tb_counts_df</span><span class="p">,</span><span class="n">x</span><span class="o">=</span><span class="s">'app_name'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'count'</span><span class="p">,</span><span class="n">hue</span><span class="o">=</span><span class="s">'textblob-analysis'</span><span class="p">,</span> <span class="n">palette</span><span class="o">=</span><span class="n">col</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Sentiment Analysis using TextBlob'</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Score Count'</span><span class="p">)</span>

<span class="c1">## Plot 2
</span><span class="n">sns</span><span class="p">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="n">data</span><span class="o">=</span><span class="n">vd_counts_df</span><span class="p">,</span><span class="n">x</span><span class="o">=</span><span class="s">'app_name'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'count'</span><span class="p">,</span><span class="n">hue</span><span class="o">=</span><span class="s">'vader-analysis'</span><span class="p">,</span> <span class="n">palette</span><span class="o">=</span><span class="n">col</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Sentiment Analysis using VADER'</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Score Count'</span><span class="p">)</span>

<span class="c1">## Plot 3
</span><span class="n">sns</span><span class="p">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span><span class="n">data</span><span class="o">=</span><span class="n">swn_counts_df</span><span class="p">,</span><span class="n">x</span><span class="o">=</span><span class="s">'app_name'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'count'</span><span class="p">,</span><span class="n">hue</span><span class="o">=</span><span class="s">'swn-analysis'</span><span class="p">,</span> <span class="n">palette</span><span class="o">=</span><span class="n">col</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Sentiment Analysis using SentiWordNet'</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Score Count'</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0, 0.5, 'Score Count')
</code></pre></div></div>

<p><img src="/assets/images/multiple-rule-based-sentiment-analysis/multiple-rule-based-sentiment-analysis_26_1.png" alt="Comparing Sentiment Analysis" /></p>

<h3 id="percentage-comparison">Percentage Comparison</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Convert sentiment results from series into dataframes
</span><span class="n">tb_grouped_df</span> <span class="o">=</span> <span class="n">tb_counts_df</span><span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'app_name'</span><span class="p">,</span><span class="n">tb_counts_df</span><span class="p">[</span><span class="s">'textblob-analysis'</span><span class="p">]]).</span><span class="n">agg</span><span class="p">({</span><span class="s">'count'</span><span class="p">:</span><span class="s">'sum'</span><span class="p">})</span>
<span class="n">tb_percent_df</span> <span class="o">=</span> <span class="n">tb_grouped_df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mi">100</span><span class="o">*</span><span class="n">x</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="nb">sum</span><span class="p">()))</span>
<span class="n">tb_percent_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">tb_percent_df</span><span class="p">).</span><span class="n">reset_index</span><span class="p">().</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">'count'</span><span class="p">:</span><span class="s">'perc_count'</span><span class="p">})</span>

<span class="n">vd_grouped_df</span> <span class="o">=</span> <span class="n">vd_counts_df</span> <span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'app_name'</span><span class="p">,</span><span class="n">vd_counts_df</span><span class="p">[</span><span class="s">'vader-analysis'</span><span class="p">]]).</span><span class="n">agg</span><span class="p">({</span><span class="s">'count'</span><span class="p">:</span><span class="s">'sum'</span><span class="p">})</span>
<span class="n">vd_percent_df</span> <span class="o">=</span> <span class="n">vd_grouped_df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mi">100</span><span class="o">*</span><span class="n">x</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="nb">sum</span><span class="p">()))</span>
<span class="n">vd_percent_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">vd_percent_df</span><span class="p">).</span><span class="n">reset_index</span><span class="p">().</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">'count'</span><span class="p">:</span><span class="s">'perc_count'</span><span class="p">})</span>


<span class="n">swn_grouped_df</span> <span class="o">=</span> <span class="n">swn_counts_df</span> <span class="p">.</span><span class="n">groupby</span><span class="p">([</span><span class="s">'app_name'</span><span class="p">,</span><span class="n">swn_counts_df</span><span class="p">[</span><span class="s">'swn-analysis'</span><span class="p">]]).</span><span class="n">agg</span><span class="p">({</span><span class="s">'count'</span><span class="p">:</span><span class="s">'sum'</span><span class="p">})</span>
<span class="n">swn_percent_df</span> <span class="o">=</span> <span class="n">swn_grouped_df</span><span class="p">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">).</span><span class="nb">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="mi">100</span><span class="o">*</span><span class="n">x</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="nb">sum</span><span class="p">()))</span>
<span class="n">swn_percent_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">swn_percent_df</span><span class="p">).</span><span class="n">reset_index</span><span class="p">().</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">'count'</span><span class="p">:</span><span class="s">'perc_count'</span><span class="p">})</span>

</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sns</span><span class="p">.</span><span class="n">set_style</span><span class="p">(</span> <span class="s">'darkgrid'</span> <span class="p">)</span>
<span class="n">col</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">color_palette</span><span class="p">(</span><span class="s">"Set2"</span><span class="p">)</span>
<span class="n">fig1</span><span class="p">,</span> <span class="n">axes1</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">[</span><span class="mi">30</span><span class="p">,</span><span class="mi">8</span><span class="p">])</span>
<span class="n">fig1</span><span class="p">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">'Rule Based Sentiment Analysis on Syfe, Endowus and StashAway'</span><span class="p">)</span>

<span class="c1">## Plot 1
</span><span class="n">sns</span><span class="p">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axes1</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">data</span><span class="o">=</span><span class="n">tb_percent_df</span><span class="p">,</span><span class="n">x</span><span class="o">=</span><span class="s">'app_name'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'perc_count'</span><span class="p">,</span><span class="n">hue</span><span class="o">=</span><span class="s">'textblob-analysis'</span><span class="p">,</span> <span class="n">palette</span><span class="o">=</span><span class="n">col</span><span class="p">)</span>
<span class="n">axes1</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Sentiment Analysis using TextBlob'</span><span class="p">)</span>
<span class="n">axes1</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Score Count (%)'</span><span class="p">)</span>

<span class="c1">## Plot 2
</span><span class="n">sns</span><span class="p">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axes1</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span><span class="n">data</span><span class="o">=</span><span class="n">vd_percent_df</span><span class="p">,</span><span class="n">x</span><span class="o">=</span><span class="s">'app_name'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'perc_count'</span><span class="p">,</span><span class="n">hue</span><span class="o">=</span><span class="s">'vader-analysis'</span><span class="p">,</span> <span class="n">palette</span><span class="o">=</span><span class="n">col</span><span class="p">)</span>
<span class="n">axes1</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Sentiment Analysis using VADER'</span><span class="p">)</span>
<span class="n">axes1</span><span class="p">[</span><span class="mi">1</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Score Count (%)'</span><span class="p">)</span>

<span class="c1">## Plot 3
</span><span class="n">sns</span><span class="p">.</span><span class="n">barplot</span><span class="p">(</span><span class="n">ax</span><span class="o">=</span><span class="n">axes1</span><span class="p">[</span><span class="mi">2</span><span class="p">],</span><span class="n">data</span><span class="o">=</span><span class="n">swn_percent_df</span><span class="p">,</span><span class="n">x</span><span class="o">=</span><span class="s">'app_name'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'perc_count'</span><span class="p">,</span><span class="n">hue</span><span class="o">=</span><span class="s">'swn-analysis'</span><span class="p">,</span> <span class="n">palette</span><span class="o">=</span><span class="n">col</span><span class="p">)</span>
<span class="n">axes1</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Sentiment Analysis using SentiWordNet'</span><span class="p">)</span>
<span class="n">axes1</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Score Count (%)'</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0, 0.5, 'Score Count (%)')
</code></pre></div></div>

<p><img src="/assets/images/multiple-rule-based-sentiment-analysis/multiple-rule-based-sentiment-analysis_29_1.png" alt="Comparing Percentage Sentiment Analysis" /></p>

<h1 id="key-takeaways">Key Takeaways</h1>
<ul>
  <li>Looking purely at their absolute numbers, Stashaway have the highest number of scores given that it has the largest number of app reviews. It’s number of positive scores are overwhelmingly higher than negative scores.</li>
  <li>Using SentiWordNet appears to depress the scores variance, with more scores distributed around neutral scores. Focusing on Stashaway, the number of positive  reviews decreased, and the number neutral scores shot up.</li>
  <li>In terms of percentage score, Endowus leads in this aspect, with the highest percentage of positive reviews compared to StashAway and Syfe.</li>
  <li>All 3 roboadvisors have a higher percentage of positive scores, with a small percentage of negative reviews.</li>
  <li>Hierarchy of choice: Endowus or Stashaway &gt; Syfe</li>
  <li>Anyone looking to choose any one of these roboadvisors can rest assured that all 3 apps have garnered good reviews from the users.</li>
</ul>]]></content><author><name>Blog Author</name></author><category term="Data Analytics" /><category term="Sentiment Analysis" /><category term="Roboadvisors" /><summary type="html"><![CDATA[Using Rule-based approach to conduct Sentiment Analysis on popular roboadvisors in Singapore]]></summary></entry><entry><title type="html">Scrapping App Reviews for popular roboadvisors in Singapore using Python</title><link href="http://www.layonsan.com/app-review-scrap/" rel="alternate" type="text/html" title="Scrapping App Reviews for popular roboadvisors in Singapore using Python" /><published>2021-09-02T00:00:00+00:00</published><updated>2021-09-02T00:00:00+00:00</updated><id>http://www.layonsan.com/app-review-scrap</id><content type="html" xml:base="http://www.layonsan.com/app-review-scrap/"><![CDATA[<p>Behind every app lies thousands of user voices. I used Python to scrape reviews for Syfe, Endowus, and StashAway from both the Apple App Store and Google Play. In this three-part series, I walk through collecting reviews from each platform and then bringing them together into one dataset ready for analysis. This work draws reference from <a href="https://python.plainenglish.io/scraping-app-store-reviews-with-python-90e4117ccdfb">Apple Store Scraper</a> and <a href="https://python.plainenglish.io/scraping-storing-google-play-app-reviews-with-python-5640c933c476">Google Play Store Scraper</a>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Importing relevant libraries
</span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="c1"># for scraping app info from App Store
</span><span class="kn">from</span> <span class="nn">itunes_app_scraper.scraper</span> <span class="kn">import</span> <span class="n">AppStoreScraper</span>

<span class="c1"># for scraping app reviews from App Store
</span><span class="kn">from</span> <span class="nn">app_store_scraper</span> <span class="kn">import</span> <span class="n">AppStore</span>

<span class="c1"># for scraping app reviews from GPS
</span><span class="kn">from</span> <span class="nn">google_play_scraper</span> <span class="kn">import</span> <span class="n">app</span><span class="p">,</span> <span class="n">Sort</span><span class="p">,</span> <span class="n">reviews</span>

<span class="c1"># for pretty printing data structures
</span><span class="kn">from</span> <span class="nn">pprint</span> <span class="kn">import</span> <span class="n">pprint</span>

<span class="c1"># for keeping track of timing
</span><span class="kn">import</span> <span class="nn">datetime</span> <span class="k">as</span> <span class="n">dt</span>
<span class="kn">from</span> <span class="nn">tzlocal</span> <span class="kn">import</span> <span class="n">get_localzone</span>

<span class="c1"># for building in wait times
</span><span class="kn">import</span> <span class="nn">random</span>
<span class="kn">import</span> <span class="nn">time</span>
</code></pre></div></div>

<h1 id="part-1---scrap-reviews-from-apple-store">Part 1 - Scrap reviews from Apple Store</h1>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Read in file containing app names and IDs
</span><span class="n">apple_app_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s">'app_info.xlsx'</span><span class="p">,</span> <span class="n">sheet_name</span><span class="o">=</span><span class="s">'apple'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"""
Printing first few rows of app's info in the csv file:
------------------------------------------------------
</span><span class="si">{</span><span class="n">apple_app_df</span><span class="p">.</span><span class="n">head</span><span class="p">()</span><span class="si">}</span><span class="s">
"""</span><span class="p">)</span>

<span class="c1">## Get list of app names and app IDs
</span><span class="n">apple_app_names</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">apple_app_df</span><span class="p">[</span><span class="s">'app_name'</span><span class="p">])</span>
<span class="n">apple_app_ids</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">apple_app_df</span><span class="p">[</span><span class="s">'iOS_app_id'</span><span class="p">])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Printing first few rows of app's info in the csv file:
------------------------------------------------------
    app_name                 iOS_app_name  iOS_app_id
0       Syfe           syfe-invest-better  1497156434
1    Endowus  endowus-invest-cpf-srs-cash  1531067679
2  StashAway    stashaway-invest-and-save  1229966330
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Set up App Store Scraper
</span><span class="n">scraper</span> <span class="o">=</span> <span class="n">AppStoreScraper</span><span class="p">()</span>
<span class="n">apple_app_store_list</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">scraper</span><span class="p">.</span><span class="n">get_multiple_app_details</span><span class="p">(</span><span class="n">apple_app_ids</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>https://itunes.apple.com/lookup?id=1497156434&amp;country=nl&amp;entity=software
https://itunes.apple.com/lookup?id=1531067679&amp;country=nl&amp;entity=software
https://itunes.apple.com/lookup?id=1229966330&amp;country=nl&amp;entity=software
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Converting list into dataframe
</span><span class="n">apple_app_info_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">apple_app_store_list</span><span class="p">)</span>
</code></pre></div></div>

<p>Given that there are no user rating counts, we can ignore itunes ratings in our analysis.</p>

<h2 id="scrapping-app-reviews-from-apple-store">Scrapping App Reviews from Apple Store</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Empty list for storing reviews
</span><span class="n">apple_app_reviews</span> <span class="o">=</span> <span class="p">[]</span>

<span class="c1">## Set up loop to go through all apps
</span><span class="k">for</span> <span class="n">app_name</span><span class="p">,</span> <span class="n">app_id</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">apple_app_names</span><span class="p">,</span> <span class="n">apple_app_ids</span><span class="p">):</span>
    
    <span class="c1"># Get start time
</span>    <span class="n">start</span> <span class="o">=</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">tz</span><span class="o">=</span><span class="n">get_localzone</span><span class="p">())</span>
    <span class="n">fmt</span><span class="o">=</span> <span class="s">"%m/%d/%y - %T %p"</span>
    
    <span class="c1"># Print starting output for app
</span>    <span class="k">print</span><span class="p">(</span><span class="s">'---'</span><span class="o">*</span><span class="mi">20</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'---'</span><span class="o">*</span><span class="mi">20</span><span class="p">)</span>    
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'***** </span><span class="si">{</span><span class="n">app_name</span><span class="si">}</span><span class="s"> started at </span><span class="si">{</span><span class="n">start</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="n">fmt</span><span class="p">)</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">()</span>
    
    <span class="c1"># Instantiate AppStore for app
</span>    <span class="n">app_</span> <span class="o">=</span> <span class="n">AppStore</span><span class="p">(</span><span class="n">country</span><span class="o">=</span><span class="s">'sg'</span><span class="p">,</span> <span class="n">app_name</span><span class="o">=</span><span class="n">app_name</span><span class="p">,</span> <span class="n">app_id</span><span class="o">=</span><span class="n">app_id</span><span class="p">)</span>
    
    <span class="c1"># Scrape reviews posted since February 28, 2020 and limit to 10,000 reviews
</span>    <span class="n">app_</span><span class="p">.</span><span class="n">review</span><span class="p">(</span><span class="n">how_many</span><span class="o">=</span><span class="mi">10000</span><span class="p">,</span>
                <span class="n">after</span><span class="o">=</span><span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">(</span><span class="mi">2020</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
                <span class="n">sleep</span><span class="o">=</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span><span class="mi">25</span><span class="p">))</span>
    
    <span class="n">reviews</span> <span class="o">=</span> <span class="n">app_</span><span class="p">.</span><span class="n">reviews</span>
    
    <span class="c1"># Add keys to store information about which app each review is for
</span>    <span class="k">for</span> <span class="n">rvw</span> <span class="ow">in</span> <span class="n">reviews</span><span class="p">:</span>
        <span class="n">rvw</span><span class="p">[</span><span class="s">'app_name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_name</span>
        <span class="n">rvw</span><span class="p">[</span><span class="s">'app_id'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_id</span>
    
    <span class="c1"># Print update that scraping was completed
</span>    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"""Done scraping </span><span class="si">{</span><span class="n">app_name</span><span class="si">}</span><span class="s">. 
    Scraped a total of </span><span class="si">{</span><span class="n">app_</span><span class="p">.</span><span class="n">reviews_count</span><span class="si">}</span><span class="s"> reviews.</span><span class="se">\n</span><span class="s">"""</span><span class="p">)</span>

     <span class="c1"># Convert list of dicts to Pandas DataFrame
</span>    <span class="n">review_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">reviews</span><span class="p">)</span>
    <span class="n">apple_app_reviews</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">review_df</span><span class="p">)</span>
    
    <span class="c1"># Get end time
</span>    <span class="n">end</span> <span class="o">=</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">tz</span><span class="o">=</span><span class="n">get_localzone</span><span class="p">())</span>
    
    <span class="c1"># Print ending output for app
</span>    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"""Successfully wrote </span><span class="si">{</span><span class="n">app_name</span><span class="si">}</span><span class="s"> reviews to df
    at </span><span class="si">{</span><span class="n">end</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="n">fmt</span><span class="p">)</span><span class="si">}</span><span class="s">.</span><span class="se">\n</span><span class="s">"""</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Time elapsed for </span><span class="si">{</span><span class="n">app_name</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">end</span><span class="o">-</span><span class="n">start</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'---'</span><span class="o">*</span><span class="mi">20</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'---'</span><span class="o">*</span><span class="mi">20</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
    
    <span class="c1"># Wait 5 to 10 seconds to start scraping next app
</span>    <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>------------------------------------------------------------
------------------------------------------------------------
***** Syfe started at 10/03/21 - 16:20:08 PM



2021-10-03 16:20:09,532 [INFO] Base - Initialised: AppStore('sg', 'syfe', 1497156434)
2021-10-03 16:20:09,534 [INFO] Base - Ready to fetch reviews from: https://apps.apple.com/sg/app/syfe/id1497156434
2021-10-03 16:20:30,811 [INFO] Base - [id:1497156434] Fetched 20 reviews (20 fetched in total)
2021-10-03 16:21:13,458 [INFO] Base - [id:1497156434] Fetched 57 reviews (57 fetched in total)
2021-10-03 16:21:13,755 [INFO] Base - [id:1497156434] Fetched 67 reviews (67 fetched in total)


Done scraping Syfe. 
    Scraped a total of 67 reviews.

Successfully wrote Syfe reviews to df
    at 10/03/21 - 16:21:13 PM.

Time elapsed for Syfe: 0:01:05.326312
------------------------------------------------------------
------------------------------------------------------------


------------------------------------------------------------
------------------------------------------------------------
***** Endowus started at 10/03/21 - 16:21:21 PM



2021-10-03 16:21:23,187 [INFO] Base - Initialised: AppStore('sg', 'endowus', 1531067679)
2021-10-03 16:21:23,188 [INFO] Base - Ready to fetch reviews from: https://apps.apple.com/sg/app/endowus/id1531067679
2021-10-03 16:21:44,489 [INFO] Base - [id:1531067679] Fetched 20 reviews (20 fetched in total)
2021-10-03 16:22:27,085 [INFO] Base - [id:1531067679] Fetched 60 reviews (60 fetched in total)
2021-10-03 16:22:27,500 [INFO] Base - [id:1531067679] Fetched 74 reviews (74 fetched in total)


Done scraping Endowus. 
    Scraped a total of 74 reviews.

Successfully wrote Endowus reviews to df
    at 10/03/21 - 16:22:27 PM.

Time elapsed for Endowus: 0:01:05.738188
------------------------------------------------------------
------------------------------------------------------------


------------------------------------------------------------
------------------------------------------------------------
***** StashAway started at 10/03/21 - 16:22:35 PM



2021-10-03 16:22:36,806 [INFO] Base - Initialised: AppStore('sg', 'stashaway', 1229966330)
2021-10-03 16:22:36,807 [INFO] Base - Ready to fetch reviews from: https://apps.apple.com/sg/app/stashaway/id1229966330
2021-10-03 16:22:59,209 [INFO] Base - [id:1229966330] Fetched 17 reviews (17 fetched in total)
2021-10-03 16:23:43,855 [INFO] Base - [id:1229966330] Fetched 50 reviews (50 fetched in total)
2021-10-03 16:24:28,497 [INFO] Base - [id:1229966330] Fetched 82 reviews (82 fetched in total)
2021-10-03 16:25:13,169 [INFO] Base - [id:1229966330] Fetched 119 reviews (119 fetched in total)
2021-10-03 16:25:57,920 [INFO] Base - [id:1229966330] Fetched 148 reviews (148 fetched in total)
2021-10-03 16:26:42,669 [INFO] Base - [id:1229966330] Fetched 180 reviews (180 fetched in total)
2021-10-03 16:27:27,424 [INFO] Base - [id:1229966330] Fetched 213 reviews (213 fetched in total)
2021-10-03 16:28:12,182 [INFO] Base - [id:1229966330] Fetched 244 reviews (244 fetched in total)
2021-10-03 16:28:56,825 [INFO] Base - [id:1229966330] Fetched 273 reviews (273 fetched in total)
2021-10-03 16:29:41,572 [INFO] Base - [id:1229966330] Fetched 309 reviews (309 fetched in total)
2021-10-03 16:30:26,278 [INFO] Base - [id:1229966330] Fetched 337 reviews (337 fetched in total)
2021-10-03 16:31:10,974 [INFO] Base - [id:1229966330] Fetched 366 reviews (366 fetched in total)
2021-10-03 16:31:55,623 [INFO] Base - [id:1229966330] Fetched 391 reviews (391 fetched in total)
2021-10-03 16:31:55,944 [INFO] Base - [id:1229966330] Fetched 391 reviews (391 fetched in total)


Done scraping StashAway. 
    Scraped a total of 391 reviews.

Successfully wrote StashAway reviews to df
    at 10/03/21 - 16:31:55 PM.

Time elapsed for StashAway: 0:09:20.442317
------------------------------------------------------------
------------------------------------------------------------
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1"># Convert list of dfs to Pandas DataFrame and write to csv
</span><span class="n">apple_reviews</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="n">apple_app_reviews</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="part-2---scrap-reviews-from-google-play-store">Part 2 - Scrap reviews from Google Play Store</h1>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Extracting data and relevant app names + Ids
</span><span class="n">google_app_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s">'app_info.xlsx'</span><span class="p">,</span><span class="n">sheet_name</span><span class="o">=</span><span class="s">'google'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">google_app_df</span><span class="p">.</span><span class="n">head</span><span class="p">())</span>

<span class="c1">## Get list of app names and app IDs
</span><span class="n">google_app_names</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">google_app_df</span><span class="p">[</span><span class="s">'app_name'</span><span class="p">])</span>
<span class="n">google_app_ids</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">google_app_df</span><span class="p">[</span><span class="s">'app_id'</span><span class="p">])</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    app_name                 app_id
0       Syfe               com.syfe
1    Endowus  com.endowus.mobileapp
2  StashAway      com.awp.stashaway
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Loop through app IDs to get app info
</span><span class="n">google_app_info</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="n">google_app_ids</span><span class="p">:</span>
    <span class="n">info</span> <span class="o">=</span> <span class="n">app</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
    <span class="k">del</span> <span class="n">info</span><span class="p">[</span><span class="s">'comments'</span><span class="p">]</span>
    <span class="n">google_app_info</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">info</span><span class="p">)</span>

<span class="c1">## Pretty print the data for the first app
</span><span class="n">pprint</span><span class="p">(</span><span class="n">google_app_info</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>

<span class="n">google_app_infos</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">google_app_info</span><span class="p">)</span>
<span class="c1"># app_infos_df.to_csv('apps.csv', index=None, header=True)
# google_app_infos
</span></code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{'adSupported': None,
 'androidVersion': None,
 'androidVersionText': None,
 'appId': 'com.syfe',
 'containsAds': False,
 'contentRating': None,
 'contentRatingDescription': None,
 'currency': None,
 'description': None,
 'descriptionHTML': None,
 'developer': None,
 'developerAddress': None,
 'developerEmail': None,
 'developerId': None,
 'developerInternalID': None,
 'developerWebsite': None,
 'editorsChoice': False,
 'free': None,
 'genre': None,
 'genreId': None,
 'headerImage': None,
 'histogram': [0, 0, 0, 0, 0],
 'icon': None,
 'inAppProductPrice': None,
 'installs': None,
 'minInstalls': None,
 'offersIAP': False,
 'originalPrice': None,
 'price': None,
 'privacyPolicy': None,
 'ratings': None,
 'recentChanges': None,
 'recentChangesHTML': None,
 'released': None,
 'reviews': None,
 'sale': False,
 'saleText': None,
 'saleTime': None,
 'score': None,
 'screenshots': [],
 'size': None,
 'summary': None,
 'summaryHTML': None,
 'title': None,
 'updated': None,
 'url': 'https://play.google.com/store/apps/details?id=com.syfe&amp;hl=en&amp;gl=us',
 'version': [None,
             [[[[[None,
                  [None,
                   [[None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/_TcrYZaOKkM12SLSZyKWO4l_QgHSkhvXi1m0tm7OnwyxzAY3YrTUKYSpmhp5QM1gf-zF'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/_TcrYZaOKkM12SLSZyKWO4l_QgHSkhvXi1m0tm7OnwyxzAY3YrTUKYSpmhp5QM1gf-zF'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/_TcrYZaOKkM12SLSZyKWO4l_QgHSkhvXi1m0tm7OnwyxzAY3YrTUKYSpmhp5QM1gf-zF'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2]],
                   2,
                   2],
                  'SGX Mobile',
                  None,
                  [[['SGX',
                     [None,
                      None,
                      None,
                      None,
                      [None, None, '/store/apps/developer?id=SGX']],
                     True]],
                   [None,
                    [None,
                     [None,
                      'Live market data, news and company announcements of all '
                      'SGX-listed companies']]]],
                  [],
                  [[None, None, [None, ['4.1', 4.0926642]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None, None, '/store/apps/details?id=com.sgx.SGXandroid']],
                  None,
                  ['CAIaWAoaEhgKEmNvbS5zZ3guU0dYYW5kcm9pZBABGAMQADITCKDuzMPsrfMCFbqhSwUdxdUFmnITCNKM1PjrrfMCFfWESwUdtUoDVIoBDQgAEgkKBWVuLVVTEACqAl0aWwgAEhoKGAoSY29tLnNneC5TR1hhbmRyb2lkEAEYA0oTCKDuzMPsrfMCFbqhSwUdxdUFmpoBEwjSjNT4663zAhX1hEsFHbVKA1T6AQ8KDQgAEgkKBWVuLVVTEAA='],
                  ['com.sgx.SGXandroid', 7]],
                 [None,
                  [None,
                   [[None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/HJUz_In6O2_dQ-cLMju7pt9qq5xFXzp25xCr_P663EFr3f3C2rcraCvNtrIF9YrX5FxI'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/HJUz_In6O2_dQ-cLMju7pt9qq5xFXzp25xCr_P663EFr3f3C2rcraCvNtrIF9YrX5FxI'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/HJUz_In6O2_dQ-cLMju7pt9qq5xFXzp25xCr_P663EFr3f3C2rcraCvNtrIF9YrX5FxI'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2]],
                   2,
                   2],
                  'UOBAM Invest',
                  None,
                  [[['UOB Asset Management Ltd',
                     [None,
                      None,
                      None,
                      None,
                      [None,
                       None,
                       '/store/apps/developer?id=UOB+Asset+Management+Ltd']],
                     True]],
                   [None,
                    [None,
                     [None,
                      'UOBAM Invest is your personal robo-adviser to help you '
                      'build your future wealth.']]]],
                  [],
                  [[None, None, [None, ['3.8', 3.8282828]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None,
                    None,
                    '/store/apps/details?id=com.uobam.uobaminvest']],
                  None,
                  ['CAIaWwodEhsKFWNvbS51b2JhbS51b2JhbWludmVzdBABGAMQATITCKDuzMPsrfMCFbqhSwUdxdUFmnITCNKM1PjrrfMCFfWESwUdtUoDVIoBDQgAEgkKBWVuLVNHEACqAmAaXggBEh0KGwoVY29tLnVvYmFtLnVvYmFtaW52ZXN0EAEYA0oTCKDuzMPsrfMCFbqhSwUdxdUFmpoBEwjSjNT4663zAhX1hEsFHbVKA1T6AQ8KDQgAEgkKBWVuLVNHEAA='],
                  ['com.uobam.uobaminvest', 7]],
                 [None,
                  [None,
                   [[None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/BxJeLxjKGNka1wdqF8SF5hXq3gRbDYBDDSJN14T4QwvtsKhqgVgUT4ms9yvtt-O1QPEU'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/BxJeLxjKGNka1wdqF8SF5hXq3gRbDYBDDSJN14T4QwvtsKhqgVgUT4ms9yvtt-O1QPEU'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/BxJeLxjKGNka1wdqF8SF5hXq3gRbDYBDDSJN14T4QwvtsKhqgVgUT4ms9yvtt-O1QPEU'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2]],
                   2,
                   2],
                  'StashAway: Invest and save',
                  None,
                  [[['Asia Wealth Platform Pte Ltd',
                     [None,
                      None,
                      None,
                      None,
                      [None,
                       None,
                       '/store/apps/developer?id=Asia+Wealth+Platform+Pte+Ltd']],
                     True]],
                   [None, [None, [None, 'Personal finance and investing']]]],
                  [],
                  [[None, None, [None, ['4.1', 4.1464176]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None, None, '/store/apps/details?id=com.awp.stashaway']],
                  None,
                  ['CAIaVwoZEhcKEWNvbS5hd3Auc3Rhc2hhd2F5EAEYAxACMhMIoO7Mw+yt8wIVuqFLBR3F1QWachMI0ozU+Out8wIV9YRLBR21SgNUigENCAASCQoFZW4tVVMQAKoCXBpaCAISGQoXChFjb20uYXdwLnN0YXNoYXdheRABGANKEwig7szD7K3zAhW6oUsFHcXVBZqaARMI0ozU+Out8wIV9YRLBR21SgNU+gEPCg0IABIJCgVlbi1VUxAA'],
                  ['com.awp.stashaway', 7]],
                 [None,
                  [None,
                   [[None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/8606sVmXHRX7HJTtoS8jSuVS7HVl4BXt-SLqVo7tKNEw4dDMP27KvEcd3d2NXH3hkpE'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/8606sVmXHRX7HJTtoS8jSuVS7HVl4BXt-SLqVo7tKNEw4dDMP27KvEcd3d2NXH3hkpE'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2],
                    [None,
                     2,
                     [512, 512],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/8606sVmXHRX7HJTtoS8jSuVS7HVl4BXt-SLqVo7tKNEw4dDMP27KvEcd3d2NXH3hkpE'],
                     None,
                     None,
                     None,
                     None,
                     None,
                     2]],
                   2,
                   2],
                  'Tiger Trade-Global Invest&amp;Save',
                  None,
                  [[['TIGER BROKERS',
                     [None,
                      None,
                      None,
                      None,
                      [None, None, '/store/apps/developer?id=TIGER+BROKERS']],
                     True]],
                   [None,
                    [None, [None, 'ETF,Options,Futures&amp;amp;Free Quote']]]],
                  [],
                  [[None, None, [None, ['4.5', 4.451025]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None,
                    None,
                    '/store/apps/details?id=com.tigerbrokers.stock']],
                  None,
                  ['CAIaXAoeEhwKFmNvbS50aWdlcmJyb2tlcnMuc3RvY2sQARgDEAMyEwig7szD7K3zAhW6oUsFHcXVBZpyEwjSjNT4663zAhX1hEsFHbVKA1SKAQ0IABIJCgVlbi1VUxAAqgJhGl8IAxIeChwKFmNvbS50aWdlcmJyb2tlcnMuc3RvY2sQARgDShMIoO7Mw+yt8wIVuqFLBR3F1QWamgETCNKM1PjrrfMCFfWESwUdtUoDVPoBDwoNCAASCQoFZW4tVVMQAA=='],
                  ['com.tigerbrokers.stock', 7]],
                 [None,
                  [None,
                   [[None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/pHMOIRJ21PkHLMdk1yjQJPsVnyx-CKgdtjd3VOnGb1JY7inJECHe_o7hFljJa8wcHlA']],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/pHMOIRJ21PkHLMdk1yjQJPsVnyx-CKgdtjd3VOnGb1JY7inJECHe_o7hFljJa8wcHlA']],
                    [None,
                     2,
                     [0, 0],
                     [None,
                      None,
                      'https://play-lh.googleusercontent.com/pHMOIRJ21PkHLMdk1yjQJPsVnyx-CKgdtjd3VOnGb1JY7inJECHe_o7hFljJa8wcHlA']]],
                   2,
                   2],
                  'Wahed Invest',
                  None,
                  [[['Wahed Inc.',
                     [None,
                      None,
                      None,
                      None,
                      [None, None, '/store/apps/developer?id=Wahed+Inc.']],
                     True]],
                   [None, [None, [None, 'Ethical Investing Made Simple']]]],
                  [],
                  [[None, None, [None, ['3.7', 3.6796117]]]],
                  [],
                  [5, 4, 5],
                  [None,
                   None,
                   None,
                   None,
                   [None, None, '/store/apps/details?id=com.wahed.mobile']],
                  None,
                  ['CAIaVgoYEhYKEGNvbS53YWhlZC5tb2JpbGUQARgDEAQyEwig7szD7K3zAhW6oUsFHcXVBZpyEwjSjNT4663zAhX1hEsFHbVKA1SKAQ0IABIJCgVlbi1VUxAAqgJbGlkIBBIYChYKEGNvbS53YWhlZC5tb2JpbGUQARgDShMIoO7Mw+yt8wIVuqFLBR3F1QWamgETCNKM1PjrrfMCFfWESwUdtUoDVPoBDwoNCAASCQoFZW4tVVMQAA=='],
                  ['com.wahed.mobile', 7]]],
                'Similar',
                None,
                [None,
                 None,
                 None,
                 None,
                 [None,
                  None,
                  '/store/apps/collection/cluster?clp=ogoWCBEqAggIMg4KCGNvbS5zeWZlEAEYAw%3D%3D:S:ANO1ljIao8I&amp;gsr=ChmiChYIESoCCAgyDgoIY29tLnN5ZmUQARgD:S:ANO1ljJbwVo']],
                True,
                2,
                None,
                [None,
                 'CjWC0_-4Ay8KJvqegZ0DIAgGEKHj-8YKEIqQ4PEDEOLnka0JEMrhhc4KEL_S040MEI-SyKrELxAFGhmiChYIESoCCAgyDgoIY29tLnN5ZmUQARgD'],
                True],
               None,
               None,
               ['CBSqARUKEwiw4c7D7K3zAhW6oUsFHcXVBZo=']]],
             None,
             [],
             True],
 'video': None,
 'videoImage': None}
</code></pre></div></div>

<h2 id="scrapping-google-play-store-reviews">Scrapping Google Play Store reviews</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># for scraping app reviews from Google Play Store
</span><span class="kn">from</span> <span class="nn">google_play_scraper</span> <span class="kn">import</span> <span class="n">app</span><span class="p">,</span> <span class="n">Sort</span><span class="p">,</span> <span class="n">reviews</span>

<span class="c1"># Empty list for storing reviews
</span><span class="n">google_app_reviews</span> <span class="o">=</span> <span class="p">[]</span>

<span class="c1">## Loop through apps to get reviews
</span><span class="k">for</span> <span class="n">app_name</span><span class="p">,</span> <span class="n">app_id</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">google_app_names</span><span class="p">,</span> <span class="n">google_app_ids</span><span class="p">):</span>
    
    <span class="c1"># Get start time
</span>    <span class="n">start</span> <span class="o">=</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">tz</span><span class="o">=</span><span class="n">get_localzone</span><span class="p">())</span>
    <span class="n">fmt</span><span class="o">=</span> <span class="s">"%m/%d/%y - %T %p"</span>    
    
    <span class="c1"># Print starting output for app
</span>    <span class="k">print</span><span class="p">(</span><span class="s">'---'</span><span class="o">*</span><span class="mi">20</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'---'</span><span class="o">*</span><span class="mi">20</span><span class="p">)</span>    
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'***** </span><span class="si">{</span><span class="n">app_name</span><span class="si">}</span><span class="s"> started at </span><span class="si">{</span><span class="n">start</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="n">fmt</span><span class="p">)</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">()</span>
    
    <span class="c1"># Number of reviews to scrape per batch
</span>    <span class="n">count</span> <span class="o">=</span> <span class="mi">200</span>
    
    <span class="c1"># To keep track of how many batches have been completed
</span>    <span class="n">batch_num</span> <span class="o">=</span> <span class="mi">0</span>
     
    <span class="c1"># Retrieve reviews (and continuation_token) with reviews function
</span>    <span class="n">rvws</span><span class="p">,</span> <span class="n">token</span> <span class="o">=</span> <span class="n">reviews</span><span class="p">(</span>
        <span class="n">app_id</span><span class="p">,</span>           <span class="c1"># found in app's url
</span>        <span class="n">lang</span><span class="o">=</span><span class="s">'en'</span><span class="p">,</span>        <span class="c1"># defaults to 'en'
</span>        <span class="n">country</span><span class="o">=</span><span class="s">'us'</span><span class="p">,</span>     <span class="c1"># defaults to 'us'
</span>        <span class="n">sort</span><span class="o">=</span><span class="n">Sort</span><span class="p">.</span><span class="n">NEWEST</span><span class="p">,</span> <span class="c1"># start with most recent
</span>        <span class="n">count</span><span class="o">=</span><span class="n">count</span>       <span class="c1"># batch size
</span>    <span class="p">)</span>
    
    <span class="c1"># For each review obtained
</span>    <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">rvws</span><span class="p">:</span>
        <span class="n">r</span><span class="p">[</span><span class="s">'app_name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_name</span> <span class="c1"># add key for app's name
</span>        <span class="n">r</span><span class="p">[</span><span class="s">'app_id'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_id</span>     <span class="c1"># add key for app's id
</span>     
    
    <span class="c1"># Add the list of review dicts to overall list
</span>    <span class="n">google_app_reviews</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="n">rvws</span><span class="p">)</span>
    
    <span class="c1"># Increase batch count by one
</span>    <span class="n">batch_num</span> <span class="o">+=</span><span class="mi">1</span> 
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Batch </span><span class="si">{</span><span class="n">batch_num</span><span class="si">}</span><span class="s"> completed.'</span><span class="p">)</span>
    
    <span class="c1"># Wait 1 to 5 seconds to start next batch
</span>    <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
    
    <span class="c1"># Append review IDs to list prior to starting next batch
</span>    <span class="n">pre_review_ids</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">rvw</span> <span class="ow">in</span> <span class="n">google_app_reviews</span><span class="p">:</span>
        <span class="n">pre_review_ids</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">rvw</span><span class="p">[</span><span class="s">'reviewId'</span><span class="p">])</span>
    
    <span class="c1"># Loop through at most max number of batches
</span>    <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4999</span><span class="p">):</span>
        <span class="n">rvws</span><span class="p">,</span> <span class="n">token</span> <span class="o">=</span> <span class="n">reviews</span><span class="p">(</span> <span class="c1"># store continuation_token
</span>            <span class="n">app_id</span><span class="p">,</span>
            <span class="n">lang</span><span class="o">=</span><span class="s">'en'</span><span class="p">,</span>
            <span class="n">country</span><span class="o">=</span><span class="s">'us'</span><span class="p">,</span>
            <span class="n">sort</span><span class="o">=</span><span class="n">Sort</span><span class="p">.</span><span class="n">NEWEST</span><span class="p">,</span>
            <span class="n">count</span><span class="o">=</span><span class="n">count</span><span class="p">,</span>
            <span class="c1"># using token obtained from previous batch
</span>            <span class="n">continuation_token</span><span class="o">=</span><span class="n">token</span>
        <span class="p">)</span>
        
        <span class="c1"># Append unique review IDs from current batch to new list
</span>        <span class="n">new_review_ids</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">rvws</span><span class="p">:</span>
            <span class="n">new_review_ids</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="s">'reviewId'</span><span class="p">])</span>
            
            <span class="c1"># And add keys for name and id to each review dict
</span>            <span class="n">r</span><span class="p">[</span><span class="s">'app_name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_name</span> <span class="c1"># add key for app's name
</span>            <span class="n">r</span><span class="p">[</span><span class="s">'app_id'</span><span class="p">]</span> <span class="o">=</span> <span class="n">app_id</span>     <span class="c1"># add key for app's id
</span>     
        <span class="c1"># Add the list of review dicts to main app_reviews list
</span>        <span class="n">google_app_reviews</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="n">rvws</span><span class="p">)</span>
        
        <span class="c1"># Increase batch count by one
</span>        <span class="n">batch_num</span> <span class="o">+=</span><span class="mi">1</span>
        
        <span class="c1"># Break loop and stop scraping for current app if most recent batch
</span>          <span class="c1"># did not add any unique reviews
</span>        <span class="n">all_review_ids</span> <span class="o">=</span> <span class="n">pre_review_ids</span> <span class="o">+</span> <span class="n">new_review_ids</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">pre_review_ids</span><span class="p">))</span> <span class="o">==</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">all_review_ids</span><span class="p">)):</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'No reviews left to scrape. Completed </span><span class="si">{</span><span class="n">batch_num</span><span class="si">}</span><span class="s"> batches.</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
            <span class="k">break</span>
        
        <span class="c1"># all_review_ids becomes pre_review_ids to check against 
</span>          <span class="c1"># for next batch
</span>        <span class="n">pre_review_ids</span> <span class="o">=</span> <span class="n">all_review_ids</span>
        
        <span class="c1"># Wait 1 to 5 seconds to start next batch
</span>        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
      
    
    <span class="c1"># Print update when max number of batches has been reached
</span>      <span class="c1"># OR when last batch didn't add any unique reviews
</span>    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Done scraping </span><span class="si">{</span><span class="n">app_name</span><span class="si">}</span><span class="s">.'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Scraped a total of </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">pre_review_ids</span><span class="p">))</span><span class="si">}</span><span class="s"> unique reviews.</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
    
    <span class="c1"># Get end time
</span>    <span class="n">end</span> <span class="o">=</span> <span class="n">dt</span><span class="p">.</span><span class="n">datetime</span><span class="p">.</span><span class="n">now</span><span class="p">(</span><span class="n">tz</span><span class="o">=</span><span class="n">get_localzone</span><span class="p">())</span>
    
    <span class="c1"># Print ending output for app
</span>    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"""
    Successfully inserted all </span><span class="si">{</span><span class="n">app_name</span><span class="si">}</span><span class="s"> reviews into collection
    at </span><span class="si">{</span><span class="n">end</span><span class="p">.</span><span class="n">strftime</span><span class="p">(</span><span class="n">fmt</span><span class="p">)</span><span class="si">}</span><span class="s">.</span><span class="se">\n</span><span class="s">
    """</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Time elapsed for </span><span class="si">{</span><span class="n">app_name</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">end</span><span class="o">-</span><span class="n">start</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'---'</span><span class="o">*</span><span class="mi">20</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'---'</span><span class="o">*</span><span class="mi">20</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
    
    <span class="c1"># Wait 1 to 5 seconds to start scraping next app
</span>    <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">5</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>------------------------------------------------------------
------------------------------------------------------------
***** Syfe started at 09/04/21 - 23:33:25 PM

Batch 1 completed.
No reviews left to scrape. Completed 2 batches.

Done scraping Syfe.
Scraped a total of 110 unique reviews.


    Successfully inserted all Syfe reviews into collection
    at 09/04/21 - 23:33:31 PM.

    
Time elapsed for Syfe: 0:00:05.276724
------------------------------------------------------------
------------------------------------------------------------


------------------------------------------------------------
------------------------------------------------------------
***** Endowus started at 09/04/21 - 23:33:32 PM

Batch 1 completed.
No reviews left to scrape. Completed 2 batches.

Done scraping Endowus.
Scraped a total of 250 unique reviews.


    Successfully inserted all Endowus reviews into collection
    at 09/04/21 - 23:33:34 PM.

    
Time elapsed for Endowus: 0:00:02.298619
------------------------------------------------------------
------------------------------------------------------------


------------------------------------------------------------
------------------------------------------------------------
***** StashAway started at 09/04/21 - 23:33:38 PM

Batch 1 completed.
No reviews left to scrape. Completed 8 batches.

Done scraping StashAway.
Scraped a total of 1515 unique reviews.


    Successfully inserted all StashAway reviews into collection
    at 09/04/21 - 23:34:00 PM.

    
Time elapsed for StashAway: 0:00:22.199429
------------------------------------------------------------
------------------------------------------------------------
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Converting output to dataframe
</span><span class="n">google_reviews</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">google_app_reviews</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="part-3-combining-both-apple-and-google-store-reviews">Part 3: Combining both Apple and Google Store reviews</h2>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Apple Store: </span><span class="si">{</span><span class="n">np</span><span class="p">.</span><span class="n">shape</span><span class="p">(</span><span class="n">apple_reviews</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s"> rows and </span><span class="si">{</span><span class="n">np</span><span class="p">.</span><span class="n">shape</span><span class="p">(</span><span class="n">apple_reviews</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s"> columns.'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Google Play Store: </span><span class="si">{</span><span class="n">np</span><span class="p">.</span><span class="n">shape</span><span class="p">(</span><span class="n">google_reviews</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="si">}</span><span class="s"> rows and </span><span class="si">{</span><span class="n">np</span><span class="p">.</span><span class="n">shape</span><span class="p">(</span><span class="n">google_reviews</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s"> columns.'</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Apple Store: 524 rows and 9 columns.
Google Play Store: 1515 rows and 12 columns.
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">new_apple</span> <span class="o">=</span> <span class="n">apple_reviews</span><span class="p">[[</span><span class="s">'app_name'</span><span class="p">,</span><span class="s">'review'</span><span class="p">]]</span> <span class="c1"># Selecting app_name and review from apple reviews into new df
</span><span class="n">new_apple</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">'review'</span><span class="p">:</span><span class="s">'content'</span><span class="p">},</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="c1"># rename review content column
</span><span class="n">new_google</span> <span class="o">=</span> <span class="n">google_reviews</span><span class="p">[[</span><span class="s">'app_name'</span><span class="p">,</span><span class="s">'content'</span><span class="p">]]</span> <span class="c1"># subset app_name and content from google reviews into new df
</span><span class="n">total_reviews</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">new_apple</span><span class="p">,</span><span class="n">new_google</span><span class="p">])</span> <span class="c1"># Concat both dfs into one
</span>
<span class="c1"># saving reviews into csv file
</span><span class="n">total_reviews</span><span class="p">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">'app_reviews.csv'</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/Users/a844133yara.com/.pyenv/versions/3.9.5/envs/python_playground/lib/python3.9/site-packages/pandas/core/frame.py:5034: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().rename(
</code></pre></div></div>]]></content><author><name>Blog Author</name></author><category term="Data Analytics" /><category term="App Reviews" /><category term="Web Scrapping" /><summary type="html"><![CDATA[Using linear regression to predict housing prices in Melbourne.]]></summary></entry><entry><title type="html">Predicting Housing Prices in Melbourne</title><link href="http://www.layonsan.com/lm-MelbourneHousing/" rel="alternate" type="text/html" title="Predicting Housing Prices in Melbourne" /><published>2021-01-15T00:00:00+00:00</published><updated>2021-01-15T00:00:00+00:00</updated><id>http://www.layonsan.com/lm-MelbourneHousing</id><content type="html" xml:base="http://www.layonsan.com/lm-MelbourneHousing/"><![CDATA[<p>Predicting housing prices in Melbourne through regression analysis. This notebook walks through the full workflow—data cleaning, exploration, and modelling with linear and multiple regression. I also apply feature selection techniques (correlation and mutual information) and evaluate model performance using MAE, MSE, RMSE, and R². This notebook is adapted from <a href="https://www.kaggle.com/anthonypino/price-analysis-and-linear-regression">Price Analysis and Linear Regression</a> on Kaggle</p>

<h1 id="melbourne-housing-market">Melbourne Housing Market</h1>
<p><a href="https://www.kaggle.com/anthonypino/melbourne-housing-market?select=Melbourne_housing_FULL.csv">Housing clearance data</a> from Jan 2016</p>

<ol>
  <li>When did the Melbourne housing cooled off?</li>
  <li>Could you see it slowing down? What were the variables that showed the slowing down (was it overall price, amount sold vs unsold, change in more rentals sold and less housing, changes in which CouncilArea or Region, more houses sold in distances further away from Melbourne CBD and less closer)?</li>
  <li>Could you have predicted it?</li>
  <li>Should I hold off even longer in buying a two bedroom apartment in Northcote??</li>
</ol>

<h3 id="some-key-details">Some Key Details</h3>
<p>Suburb: Suburb
Address: Address
Rooms: Number of rooms
Price: Price in Australian dollars</p>

<h4 id="method">Method:</h4>
<p>S - property sold;
SP - property sold prior;
PI - property passed in;
PN - sold prior not disclosed;
SN - sold not disclosed;
NB - no bid;
VB - vendor bid;
W - withdrawn prior to auction;
SA - sold after auction;
SS - sold after auction price not disclosed.
N/A - price or highest bid not available.</p>

<h4 id="type">Type:</h4>
<p>br - bedroom(s);
h - house,cottage,villa, semi,terrace;
u - unit, duplex;
t - townhouse;
dev site - development site;
o res - other residential.</p>

<p>SellerG: Real Estate Agent</p>

<p>Date: Date sold</p>

<p>Distance: Distance from CBD in Kilometres</p>

<p>Regionname: General Region (West, North West, North, North east …etc)</p>

<p>Propertycount: Number of properties that exist in the suburb.</p>

<p>Bedroom2 : Scraped # of Bedrooms (from different source)</p>

<p>Bathroom: Number of Bathrooms</p>

<p>Car: Number of carspots</p>

<p>Landsize: Land Size in Metres</p>

<p>BuildingArea: Building Size in Metres</p>

<p>YearBuilt: Year the house was built</p>

<p>CouncilArea: Governing council for the area</p>

<p>Lattitude: Self explanitory</p>

<p>Longtitude: Self explanitory</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Import libraries
</span>
<span class="c1"># Data wrangling
</span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">date</span> <span class="c1"># Usage: Determine days from start
</span>
<span class="c1"># Data Visualisations
</span><span class="o">%</span><span class="n">matplotlib</span> <span class="n">inline</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">pylab</span> <span class="k">as</span> <span class="n">pl</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>

<span class="c1"># Model Development and Evaluation
</span><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span> <span class="c1"># For Model Development
</span><span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="kn">from</span> <span class="nn">sklearn</span> <span class="kn">import</span> <span class="n">metrics</span>

</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Reading source files
</span>
<span class="c1"># df_houseprice = pd.read_csv("data/MELBOURNE_HOUSE_PRICES_LESS.csv")
</span><span class="n">df_housingfull</span><span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"data/Melbourne_housing_FULL.csv"</span><span class="p">)</span>
</code></pre></div></div>

<h2 id="linear-regression">Linear Regression</h2>

<h3 id="data-cleaning">Data Cleaning</h3>

<ol>
  <li>Convert arguments in Date column to datetime</li>
  <li>Filter out data that are not housing types</li>
</ol>

<p>I will only be focusing on housing data.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Data Cleaning
</span><span class="n">df_housingfull</span> <span class="o">=</span> <span class="n">df_housingfull</span><span class="p">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s">'Lattitude'</span><span class="p">:</span><span class="s">'Latitude'</span><span class="p">})</span> <span class="c1"># Rename column names
</span>
<span class="c1"># Remove unrelevant column data
</span><span class="n">df_housingfull</span> <span class="o">=</span> <span class="n">df_housingfull</span><span class="p">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Suburb'</span><span class="p">,</span> <span class="s">'Address'</span><span class="p">,</span> <span class="s">'SellerG'</span><span class="p">,</span><span class="s">'Regionname'</span><span class="p">,</span> <span class="s">'CouncilArea'</span><span class="p">],</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="c1"># Convert date column to datetime
</span><span class="n">df_housingfull</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df_housingfull</span><span class="p">[</span><span class="s">'Date'</span><span class="p">],</span><span class="n">dayfirst</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"There are {} rows and {} columns in this dataframe"</span> <span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">df_housingfull</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">df_housingfull</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>

<span class="c1"># Create new dataframe with only housing data
</span><span class="n">df</span> <span class="o">=</span> <span class="n">df_housingfull</span><span class="p">[</span><span class="n">df_housingfull</span><span class="p">[</span><span class="s">'Type'</span><span class="p">]</span><span class="o">==</span><span class="s">'h'</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="s">"After filtering data that are not housing types, there are {} rows and {} columns in this new dataframe"</span> <span class="p">.</span><span class="nb">format</span><span class="p">(</span><span class="n">df</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">df</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>There are 34857 rows and 16 columns in this dataframe
After filtering data that are not housing types, there are 23980 rows and 16 columns in this new dataframe
</code></pre></div></div>

<h3 id="data-exploration-using-visualisations">Data Exploration using Visualisations</h3>

<ol>
  <li>Histogram plot for each variable</li>
  <li>Pair plots</li>
  <li>Observe average price change per quarter over the years</li>
</ol>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Plot Relationships between price and features
</span><span class="n">sns</span><span class="p">.</span><span class="n">set_style</span><span class="p">(</span> <span class="s">'darkgrid'</span> <span class="p">)</span>
<span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">[</span><span class="mi">20</span><span class="p">,</span><span class="mi">20</span><span class="p">])</span>

<span class="c1"># Plot 1: Scatterplot of AVerage Price against Date
</span><span class="n">mean_df</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'Date'</span><span class="p">,</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">).</span><span class="n">groupby</span><span class="p">(</span><span class="s">'Date'</span><span class="p">).</span><span class="n">mean</span><span class="p">().</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">scatter</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'Date'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'Price'</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">mean_df</span><span class="p">,</span><span class="n">edgecolor</span><span class="o">=</span><span class="s">'b'</span> <span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span> <span class="s">'Date'</span> <span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span> <span class="s">'Price'</span> <span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span> <span class="s">'Price vs Date'</span><span class="p">)</span>

<span class="c1"># Plot 2: Diagonal Correlation Matrix 
# Compute the correlation matrix
</span><span class="n">corr</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">corr</span><span class="p">()</span>
<span class="c1"># Generate a mask for the upper triangle
</span><span class="n">mask</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">triu</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">ones_like</span><span class="p">(</span><span class="n">corr</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">bool</span><span class="p">))</span>
<span class="c1"># Generate a custom diverging colormap
</span><span class="n">cmap</span> <span class="o">=</span> <span class="n">sns</span><span class="p">.</span><span class="n">diverging_palette</span><span class="p">(</span><span class="mi">230</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="n">as_cmap</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># # Draw the heatmap with the mask and correct aspect ratio
</span><span class="n">sns</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">corr</span><span class="p">,</span> <span class="n">mask</span><span class="o">=</span><span class="n">mask</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">cmap</span><span class="p">,</span> <span class="n">center</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
            <span class="n">square</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">linewidths</span><span class="o">=</span><span class="p">.</span><span class="mi">5</span><span class="p">,</span> <span class="n">cbar_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"shrink"</span><span class="p">:</span> <span class="p">.</span><span class="mi">5</span><span class="p">},</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Date'</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Property Count per Suburb'</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Property Count vs Date'</span><span class="p">)</span>

<span class="c1"># Plot 3: Boxplot of Price against number of Bathrooms
</span><span class="n">sns</span><span class="p">.</span><span class="n">boxplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'Bathroom'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'Price'</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">df</span> <span class="p">,</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span> <span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span> <span class="s">'Bathroom'</span> <span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span> <span class="s">'Price'</span> <span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span> <span class="s">'Price vs Bathroom'</span><span class="p">)</span>

<span class="c1"># Plot 4: Boxplot of Price against number of Bedrooms
</span><span class="n">sns</span><span class="p">.</span><span class="n">boxplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'Bedroom2'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'Price'</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">df</span> <span class="p">,</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">]</span> <span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span> <span class="s">'Bedroom'</span> <span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span> <span class="s">'Price'</span> <span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span> <span class="s">'Price vs Bedroom'</span><span class="p">)</span>

<span class="c1"># Plot 5: Regression plot of Average istance against Average Price
</span><span class="n">sns</span><span class="p">.</span><span class="n">regplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'Distance'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'Price'</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">mean_df</span><span class="p">,</span><span class="n">scatter_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"color"</span><span class="p">:</span> <span class="s">"black"</span><span class="p">},</span> <span class="n">line_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"color"</span><span class="p">:</span> <span class="s">"red"</span><span class="p">},</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">])</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Distance'</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Price'</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">0</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Price vs Distance'</span><span class="p">)</span>

<span class="c1"># Plot 6: Regression plot of Distance against Price
</span><span class="n">sns</span><span class="p">.</span><span class="n">regplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'Distance'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="s">'Price'</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span><span class="n">scatter_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"color"</span><span class="p">:</span> <span class="s">"black"</span><span class="p">},</span> <span class="n">line_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"color"</span><span class="p">:</span> <span class="s">"red"</span><span class="p">},</span><span class="n">ax</span><span class="o">=</span><span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">])</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">].</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Distance'</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">].</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Price'</span><span class="p">)</span>
<span class="n">axes</span><span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">].</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Price vs Distance'</span><span class="p">)</span>

</code></pre></div></div>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Text(0.5, 1.0, 'Price vs Distance')
</code></pre></div></div>

<p><img src="/assets/images/regression_housing_prices/01_LE_MelbourneHousing_9_1.png" alt="eda plot" /></p>

<p>These visualisations can help to answer the first 2 questions:</p>

<ol>
  <li>The housing prices in Melbourne appears to begin cooling off sometime between April and July in 2017.</li>
  <li>Based on the correlation matrix, the top 2 features that affects pricing is the number of Bathrooms, nunber of Bedrooms and distance (kilometres) from CBD. I plotted boxplots to visualise how price varies the number of bedrooms and bathrooms. The boxplot for the number of bedrooms indicate that there’s quite alot of variability. For distance, I used a regression plot to see how price varies. The plot shows a negative relationship between the two, which is logical since housing near CBD are usually priced higher than those in the outer regions.</li>
</ol>

<h3 id="linear-regression-model-with-all-features">Linear Regression Model with all Features</h3>

<p>In this part, I will evaluate the linear regression model using all the available features. The data is split into training and test data with a 2:1 ratio. The coefficient for each predictor variable is subsequently ranked after, showing that longitude, number of bathrooms and the vendor bid method as the top 3 most significant feature in the model.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">## Further data cleanup
# Remove missing values
</span><span class="n">df1</span> <span class="o">=</span> <span class="n">df</span><span class="p">.</span><span class="n">dropna</span><span class="p">().</span><span class="n">sort_values</span><span class="p">(</span><span class="s">'Date'</span><span class="p">)</span>

<span class="c1">###########
##Find out days since start
</span><span class="n">days_since_start</span> <span class="o">=</span> <span class="p">[(</span><span class="n">x</span><span class="o">-</span><span class="n">df1</span><span class="p">[</span><span class="s">'Date'</span><span class="p">].</span><span class="nb">min</span><span class="p">()).</span><span class="n">days</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">df1</span><span class="p">[</span><span class="s">'Date'</span><span class="p">]]</span>
<span class="n">df1</span><span class="p">[</span><span class="s">'Days'</span><span class="p">]</span> <span class="o">=</span> <span class="n">days_since_start</span>

<span class="c1"># Convert Categorical Variables to dummy/indicator variables
</span><span class="n">df2_dummies</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">get_dummies</span><span class="p">(</span><span class="n">df1</span><span class="p">[[</span><span class="s">'Type'</span><span class="p">,</span><span class="s">'Method'</span><span class="p">]])</span>
<span class="n">df2</span> <span class="o">=</span> <span class="n">df1</span><span class="p">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Type'</span><span class="p">,</span><span class="s">'Date'</span><span class="p">,</span><span class="s">'Method'</span><span class="p">],</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">).</span><span class="n">join</span><span class="p">(</span><span class="n">df2_dummies</span><span class="p">)</span>

<span class="c1"># Determine x (independent variables or predictor variables) and y (dependent variables) 
</span><span class="n">y</span> <span class="o">=</span> <span class="n">df2</span><span class="p">[</span><span class="s">'Price'</span><span class="p">]</span> <span class="c1"># Price being the dependent variable
</span><span class="n">x</span> <span class="o">=</span> <span class="n">df2</span><span class="p">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Price'</span><span class="p">],</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># Remove price from the independent variables
</span>
<span class="c1"># Split into training and test set
</span><span class="n">x_train</span><span class="p">,</span> <span class="n">x_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">,</span><span class="n">test_size</span><span class="o">=</span><span class="mf">0.33</span><span class="p">)</span>

<span class="c1"># Fit the model
</span><span class="n">model</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">model</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>

<span class="c1"># Evalute the model
</span><span class="n">ypredictions</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test</span><span class="p">)</span>

<span class="c1"># Ranking the coefficients
</span><span class="n">coeff_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">coef_</span><span class="p">,</span><span class="n">x</span><span class="p">.</span><span class="n">columns</span><span class="p">,</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'Coefficient'</span><span class="p">])</span>
<span class="n">ranked_coeff</span> <span class="o">=</span> <span class="n">coeff_df</span><span class="p">.</span><span class="n">sort_values</span><span class="p">(</span><span class="s">"Coefficient"</span><span class="p">,</span> <span class="n">ascending</span> <span class="o">=</span> <span class="bp">False</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">ranked_coeff</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                Coefficient
Longtitude     5.404310e+05
Bathroom       2.050054e+05
Rooms          8.073899e+04
Car            5.194845e+04
Method_VB      4.878232e+04
Method_S       3.758929e+04
Bedroom2       3.111879e+04
BuildingArea   1.683848e+03
Method_PI      1.355322e+03
Postcode       1.044804e+03
Days           1.491648e+02
Landsize       6.700705e+01
Propertycount  1.272204e+00
Type_h         1.164153e-10
YearBuilt     -3.213159e+03
Method_SP     -3.698939e+04
Method_SA     -5.073755e+04
Distance      -5.161010e+04
Latitude      -1.537221e+06
</code></pre></div></div>

<h4 id="scatter-plot-of-actual-vs-predicted">Scatter Plot of Actual vs Predicted</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig_lm</span><span class="p">,</span><span class="n">axes_lm</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">[</span><span class="mi">15</span><span class="p">,</span><span class="mi">10</span><span class="p">])</span> <span class="c1"># Create a custom size figure
</span>
<span class="c1"># # ax1 = fig_lm.add_subplot() # Add subplot
</span><span class="n">sns</span><span class="p">.</span><span class="n">regplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">ypredictions</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="n">y_test</span><span class="p">,</span><span class="n">line_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"color"</span><span class="p">:</span><span class="s">"red"</span><span class="p">},</span><span class="n">ax</span><span class="o">=</span><span class="n">axes_lm</span><span class="p">)</span>
<span class="n">axes_lm</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Predicted"</span><span class="p">)</span> <span class="c1"># Add x label
</span><span class="n">axes_lm</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Observed"</span><span class="p">)</span> <span class="c1"># Add y label
</span><span class="n">axes_lm</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Observed vs Predicted"</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/images/regression_housing_prices/01_LE_MelbourneHousing_14_1.png" alt="observed vs predicted" /></p>

<h4 id="distribution-plot-difference-in-actual-price-and-predicted-price">Distribution plot: difference in actual price and predicted price</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sns</span><span class="p">.</span><span class="n">displot</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="p">(</span><span class="n">y_test</span><span class="o">-</span><span class="n">ypredictions</span><span class="p">),</span><span class="n">bins</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>

</code></pre></div></div>
<p><img src="/assets/images/regression_housing_prices/01_LE_MelbourneHousing_16_1.png" alt="distribution plot" /></p>

<h4 id="evaluating-the-raw-linear-regression-model">Evaluating the Raw Linear Regression model</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">"------Evaluated predictions for a raw Linear Regression Model------"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MAE: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">mean_absolute_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MSE: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"RMSE: "</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">metrics</span><span class="p">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"R^2 "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">r2_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>------Evaluated predictions for a raw Linear Regression Model------
MAE:  303135.5223904289
MSE:  211634647505.68866
RMSE:  460037.6587907654
R^2  0.5857898755940139
</code></pre></div></div>

<h2 id="multiple-regression">Multiple Regression</h2>

<h3 id="feature-selection">Feature Selection</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">RepeatedKFold</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">cross_val_score</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_selection</span> <span class="kn">import</span> <span class="n">f_regression</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_selection</span> <span class="kn">import</span> <span class="n">SelectKBest</span>
<span class="kn">from</span> <span class="nn">sklearn.feature_selection</span> <span class="kn">import</span> <span class="n">mutual_info_regression</span>
<span class="kn">from</span> <span class="nn">sklearn.pipeline</span> <span class="kn">import</span> <span class="n">Pipeline</span>
<span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">mean</span>
<span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">std</span>
<span class="kn">from</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="n">pyplot</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">GridSearchCV</span>
</code></pre></div></div>

<h4 id="mutual-information-statistics">Mutual Information Statistics</h4>

<p>This model leverages on the correlation (most common correlation measure being pearsons correlation) to determine which variable is the most relevant.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create a function that can implement feature selection for the input training and test data
</span><span class="k">def</span> <span class="nf">select_features_mis</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">Y_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">):</span>
    <span class="c1"># Configure to select all features
</span>    <span class="n">features</span> <span class="o">=</span> <span class="n">SelectKBest</span><span class="p">(</span><span class="n">score_func</span><span class="o">=</span><span class="n">mutual_info_regression</span><span class="p">,</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">16</span><span class="p">)</span>
    <span class="c1"># Learn relationship from training data
</span>    <span class="n">features</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">Y_train</span><span class="p">)</span>
    <span class="c1"># Transform training data
</span>    <span class="n">X_train_feats</span> <span class="o">=</span> <span class="n">features</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
    <span class="c1"># Transorm test data
</span>    <span class="n">X_test_feats</span> <span class="o">=</span> <span class="n">features</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">X_train_feats</span><span class="p">,</span><span class="n">X_test_feats</span><span class="p">,</span><span class="n">features</span>

<span class="c1"># Running the regression model that applies feature selection (mutual information statistics)
# Feature selection
</span><span class="n">x_train_feats_mis</span><span class="p">,</span> <span class="n">x_test_feats_mis</span><span class="p">,</span> <span class="n">features_mis</span> <span class="o">=</span> <span class="n">select_features_mis</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

<span class="c1"># Scores for the features
</span><span class="k">for</span> <span class="n">feature</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">features_mis</span><span class="p">.</span><span class="n">scores_</span><span class="p">)):</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'Feature %d: %f'</span> <span class="o">%</span> <span class="p">(</span><span class="n">feature</span><span class="p">,</span> <span class="n">features_mis</span><span class="p">.</span><span class="n">scores_</span><span class="p">[</span><span class="n">feature</span><span class="p">]))</span>

<span class="c1"># Fit the model
</span><span class="n">model_feats_mis</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">model_feats_mis</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x_train_feats_mis</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>

<span class="c1"># Evaluate the model
</span><span class="n">ypredictions_feats_mis</span> <span class="o">=</span> <span class="n">model_feats_mis</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test_feats_mis</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Feature 0: 0.085276
Feature 1: 0.379063
Feature 2: 0.535257
Feature 3: 0.083888
Feature 4: 0.111936
Feature 5: 0.028387
Feature 6: 0.061553
Feature 7: 0.143656
Feature 8: 0.147641
Feature 9: 0.300453
Feature 10: 0.259006
Feature 11: 0.328668
Feature 12: 0.039872
Feature 13: 0.011723
Feature 14: 0.014065
Feature 15: 0.040096
Feature 16: 0.000000
Feature 17: 0.005040
Feature 18: 0.056839
</code></pre></div></div>

<h4 id="correlation-statistics">Correlation Statistics</h4>

<p>This model leverages on the correlation (most common correlation measure being pearsons correlation) to determine which variable is the most relevant.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Create a function that can implement feature selection for the input training and test data
</span><span class="k">def</span> <span class="nf">select_features_cs</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">Y_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">):</span>
    <span class="c1"># Configure to select all features
</span>    <span class="n">features</span> <span class="o">=</span> <span class="n">SelectKBest</span><span class="p">(</span><span class="n">score_func</span><span class="o">=</span><span class="n">f_regression</span><span class="p">,</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">16</span><span class="p">)</span>
    <span class="c1"># Learn relationship from training data
</span>    <span class="n">features</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span><span class="n">Y_train</span><span class="p">)</span>
    <span class="c1"># Transform training data
</span>    <span class="n">X_train_feats</span> <span class="o">=</span> <span class="n">features</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_train</span><span class="p">)</span>
    <span class="c1"># Transorm test data
</span>    <span class="n">X_test_feats</span> <span class="o">=</span> <span class="n">features</span><span class="p">.</span><span class="n">transform</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">X_train_feats</span><span class="p">,</span><span class="n">X_test_feats</span><span class="p">,</span><span class="n">features</span>

<span class="c1"># Running the regression model that applies feature selection (correlation statistics)
# Feature selection
</span><span class="n">x_train_feats_cs</span><span class="p">,</span> <span class="n">x_test_feats_cs</span><span class="p">,</span> <span class="n">features_cs</span> <span class="o">=</span> <span class="n">select_features_cs</span><span class="p">(</span><span class="n">x_train</span><span class="p">,</span><span class="n">y_train</span><span class="p">,</span><span class="n">x_test</span><span class="p">)</span>

<span class="c1"># Scores for the features
</span><span class="k">for</span> <span class="n">feature</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">features_cs</span><span class="p">.</span><span class="n">scores_</span><span class="p">)):</span>
    <span class="k">print</span><span class="p">(</span><span class="s">'Feature %d: %f'</span> <span class="o">%</span> <span class="p">(</span><span class="n">feature</span><span class="p">,</span> <span class="n">features_cs</span><span class="p">.</span><span class="n">scores_</span><span class="p">[</span><span class="n">feature</span><span class="p">]))</span>

<span class="c1"># Create model
</span><span class="n">model_feats_cs</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="c1"># Fit the model
</span><span class="n">model_feats_cs</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x_train_feats_cs</span><span class="p">,</span><span class="n">y_train</span><span class="p">)</span>
<span class="c1"># Evaluate the model
</span><span class="n">ypredictions_feats_cs</span> <span class="o">=</span> <span class="n">model_feats_cs</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x_test_feats_cs</span><span class="p">)</span>
</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Feature 0: 624.218533
Feature 1: 831.558520
Feature 2: 1.104028
Feature 3: 559.223771
Feature 4: 1039.871719
Feature 5: 52.724016
Feature 6: 6.962535
Feature 7: 918.872851
Feature 8: 354.376184
Feature 9: 356.552179
Feature 10: 228.442479
Feature 11: 11.192402
Feature 12: 75.345978
Feature 13: nan
Feature 14: 18.467231
Feature 15: 15.835652
Feature 16: 0.421705
Feature 17: 54.838265
Feature 18: 118.419765
</code></pre></div></div>

<h4 id="visualising-regression-models">Visualising Regression Models</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fig_lm</span><span class="p">,(</span><span class="n">axes_lm_mis</span><span class="p">,</span><span class="n">axes_lm_cs</span><span class="p">)</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="n">figsize</span><span class="o">=</span><span class="p">[</span><span class="mi">15</span><span class="p">,</span><span class="mi">10</span><span class="p">])</span> <span class="c1"># Create a custom size figure
</span>
<span class="c1"># Creating plot for Mutual Information Statistics
</span><span class="n">sns</span><span class="p">.</span><span class="n">regplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">ypredictions_feats_mis</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="n">y_test</span><span class="p">,</span><span class="n">line_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"color"</span><span class="p">:</span><span class="s">"red"</span><span class="p">},</span><span class="n">ax</span><span class="o">=</span><span class="n">axes_lm_mis</span><span class="p">)</span>
<span class="n">axes_lm_mis</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Predicted"</span><span class="p">)</span> <span class="c1"># Add x label
</span><span class="n">axes_lm_mis</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Observed"</span><span class="p">)</span> <span class="c1"># Add y label
</span><span class="n">axes_lm_mis</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Linear Regression: Mutual Information Statistics for Observed vs Predicted"</span><span class="p">)</span>

<span class="c1"># Creating plot for Correlation Statistics
</span><span class="n">sns</span><span class="p">.</span><span class="n">regplot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">ypredictions_feats_cs</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="n">y_test</span><span class="p">,</span><span class="n">line_kws</span><span class="o">=</span><span class="p">{</span><span class="s">"color"</span><span class="p">:</span><span class="s">"red"</span><span class="p">},</span><span class="n">ax</span><span class="o">=</span><span class="n">axes_lm_cs</span><span class="p">)</span>
<span class="n">axes_lm_cs</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">"Predicted"</span><span class="p">)</span> <span class="c1"># Add x label
</span><span class="n">axes_lm_cs</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">"Observed"</span><span class="p">)</span> <span class="c1"># Add y label
</span><span class="n">axes_lm_cs</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"Linear Regression: Correlation Statistics for Observed vs Predicted"</span><span class="p">)</span>
</code></pre></div></div>

<p><img src="/assets/images/regression_housing_prices/01_LE_MelbourneHousing_2_23_1.png" alt="Comparing Observed vs Predicted" /></p>

<h4 id="model-evaluation">Model Evaluation</h4>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">print</span><span class="p">(</span><span class="s">"------Evaluated predictions for a raw Linear Regression Model------"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MAE: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">mean_absolute_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MSE: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"RMSE: "</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">metrics</span><span class="p">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"R^2: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">r2_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions</span><span class="p">))</span>

<span class="k">print</span><span class="p">(</span><span class="s">"------Evaluated predictions for a Linear Regression Model with Correlation Statistics Feature Selection------"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MAE: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">mean_absolute_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions_feats_cs</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MSE: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions_feats_cs</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"RMSE: "</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">metrics</span><span class="p">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions_feats_cs</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"R^2: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">r2_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions_feats_cs</span><span class="p">))</span>

<span class="k">print</span><span class="p">(</span><span class="s">"------Evaluated predictions for a Linear Regression Model with Mutual Information Statistics Feature Selection------"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MAE: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">mean_absolute_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions_feats_mis</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MSE: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions_feats_mis</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"RMSE: "</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">metrics</span><span class="p">.</span><span class="n">mean_squared_error</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions_feats_mis</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"R^2: "</span><span class="p">,</span> <span class="n">metrics</span><span class="p">.</span><span class="n">r2_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span><span class="n">ypredictions_feats_mis</span><span class="p">))</span>

</code></pre></div></div>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>------Evaluated predictions for a raw Linear Regression Model------
MAE:  301077.07816341706
MSE:  235171170575.45062
RMSE:  484944.50257266616
R^2:  0.5415862801118618
------Evaluated predictions for a Linear Regression Model with Correlation Statistics Feature Selection------
MAE:  309316.00912912446
MSE:  246586130384.74255
RMSE:  496574.39561937
R^2:  0.5193353631489246
------Evaluated predictions for a Linear Regression Model with Mutual Information Statistics Feature Selection------
MAE:  301072.41752915125
MSE:  235167756102.42468
RMSE:  484940.9820817629
R^2:  0.541592935865105
</code></pre></div></div>

<p>By applying two types of feature selection techniques and comparing the models, the metrics indicate that mutual information statistics allow us to to achieve a more accurate model - higher R^2 and lower error metrics (MAE, MSE and RMSE).</p>]]></content><author><name>Blog Author</name></author><category term="Data Analytics" /><category term="Linear Regression" /><summary type="html"><![CDATA[Using linear regression to predict housing prices in Melbourne.]]></summary></entry></feed>