Projects

Deep Learning Models for NLP

Deep Averaging Network and GRU: Developed a Deep Averaging Networks (DAN) and Gated Recurrent Units (GRU) to perform sentiment analysis. Evaluated the information learned at each layer using linear probes (linear classifiers) on the output at each layer.

Neural-network based transition parsing: Developed a neural-network based transition parsing (arc-standard algorithm) model with a custom cubic activation function. Studied the performance of cubic activation vs tanh and sigmoid activations. Reference papers used: Incrementality in Deterministic Dependency Parsing and A Fast and Accurate Dependency Parser using Neural Networks

Bi-Directional GRU with Attention: Developed a model using bi-directional GRU with attention to perform relation extraction on the SemEval 2010 dataset. Reference papers used: SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals and Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification

Technologies Used: TensorFlow 2.0

Word2Vec - Skipgram and Bias Evaluation (WEAT)

Implemented the Word2Vec - Skipgram model using Cross Entropy (CE) and Noise Contrastive Estimation (NCE) loss functions.

Studied the advantages of using NCE loss over Cross Entropy loss.
Evaluated the bias introduced in word embeddings generated by this model using Word Embedding Association Test (WEAT)

Detecting Deceptive Hotel Reviews

Re-implementation of the paper Finding Deceptive Opinion Spam by Any Stretch of the Imagination. Here, I developed machine learning models to classify honest and deceptive reviews for the top 20 hotels in Chicago. The hotel reviews were obtained from TripAdvisor.

Dataset: The dataset consisted of 400 truthful and 400 deceptive hotel reviews.
Algorithms used: Support Vector Machines, Naive Bayes
Language encodings used: Uni-gram, Bi-gram, Tri-gram. The Bi-gram and Tri-gram representations consumed the prior n-gram notation as well.
Technologies used: Python
Libraries used: Scikit-Learn, NLTK
You may view the code for this project on GitHub here

Interactive Visual Dashboard for Road Accidents in the United States

Developed a concise, interactive, single-screen visual dashboard using Python and d3.js to study the effects of weather conditions on road accidents within the United States.

Dataset: The dataset contained statistics of road accidents with the United States. This dataset was obtained from Kaggle. The dataset can be found here
Technologies used:
- JavaScript: I used d3.js to create the visualisations
- Python: A Flask Server to host the dashboard, perform operations on the data (Principal Component Analysis, k-means clustering), render updated data to the dashboard.
- HTML, CSS: To create the webpage containing the visualisations
Data Science Techniques used:
- Principal Component Analysis: I performed dimension reduction using PCA to generate a scatter-plot matrix of the top 2 PC vectors.
- Stratified Sampling using k-means clustering: To project the data onto a parallel Co-ordinate chart, I performed stratified sampling using k-means clustering. This allowed me to reduce the number of datapoints being projected while maintaining the data distribution.
Libraries Used: Scikit-Learn, d3.js, Flask
You may view the code for this project on GitHub here

Machine Learning Algorithms from scratch

Implemented 5 machine learning algorithms from scratch (i.e. WITHOUT Scikit-Learn). The algorithms I implemented are:

Perceptron:
- I implemented the perceptron algorithm from scratch using Python. My implementation can be found on GitHub here
Adaptive Boosting (AdaBoost):
- I implemented Adaptive Boosting from scratch using Python.
- I used decision stumps as the weak learners for my implementation
- My implementation can be found on GitHub here
Support Vector Machines (SVM):
- I implemented Support Vector Machines from scratch using Python.
- I used stocastic gradient decent optimization for my implementation of SVM.
- My implementation can be found on GitHub here
K-means Clustering:
- I implemented the k-means clustering algorithm from scratch using Python.
- I used Euclidean distance, Manhattan distance and Minkowski distance metrics for my implementation
- My implementation can be found on GitHub here
K-nearest Neighbors:
- I implemented the k nearest neighbors algorithm from scratch using Python
- My implementation can be found on GitHub here

COVID-19 Data Analysis using Hadoop and Spark

Used Hadoop and Spark to analyse a COVID-19 dataset.

I performed 3 tasks:
- Obtain the number of cases per country and world-wide
- Obtain the number of cases per country and world-wide for a given time period
- Obtain country-wise number of cases per million
Technologies used: Hadoop, Spark, Java
My implementation can be found on GitHub here