- Created GAN in pytorch. Notebook here
- Created fun autoencoder in pytorch. Notebook here
- Completed course 1 of Advanced Machine Learning Specialization from the Higher School of Economics in Moscow (National Research University).
- Completed Deep Learning Specialization from Coursera. This is a 5-course specialization on deep learning, which teaches both theory and application of deep learning. This specialization is taught by Prof. Andrew Ng (Adjunct Professor at Stanford University, who was also founding lead at Google Brain).
- See my certificate here. Some interesting projects that I completed were:
- Art generation with neural style transfer
- Face recognition
- YOLO for autonomous driving (car detection)
- Building ResNet (residual network) architecture from scratch
- Machine translation
- Trigger word detection
- Character level language model
- Jazz solo with an LSTM network
- Creating an “emojifier” to make text messages more expressive
The following section contains topics and areas that I am interested in.
Deep reinforcement learning
- The BabyAI platform (see paper here), imitation learning, and grounded language learning with DL. The BabyAI platform is a testbed for improving the sample efficiency of deep RL algorithms with the goal of making dramatic advances in grounded language learning.
- I’ve reproduced the sample efficiency results posted in Table 3 of the BabyAI paper for imitation learning from bot.
- Sequence to sequence RNN models for machine translation (pytorch)
- Transformers for neural machine translation
- Prototypical networks for few-shot classification
- Upcoming: graph neural networks
The following section contains my (mostly completed) projects, organized by topic/area of data science.
New York Times Comments Analysis (Fall 2019)
- Description: The objective of this project is perform text classification to predict New York Times article topics from the comments written by readers.
- Dataset: Data is from this kaggle competition. This open-source dataset contains NYT reader’s comments and articles from January to May 2017 and January to April 2018. The comments data contains “over 2 million comments and 34 features”, while the articles data contains more than 9000 articles and 16 features.
- Packages/languages: R and Keras.
Object detection: understanding clouds from satellite Images (Fall 2019)
- Description: The objective of this project is to predict the presence of specific cloud types in satellite images. However, this is not a standard object detection challenge because each image may contain multiple labels and multiple classes that are overlapping. That is, one training image may have an area that is labeled as 2 or more distinct classes. So this is the main challenge: how to deal with overlapping labels for objects that have indefinite shapes (like clouds).
- Dataset: Data is from this kaggle competition. Dataset consists of ~9,300 unique photos. 5,600 in training, 3,700 in test.
- Packages/languages: Convolutional neural network written in Python + Keras + Tensorflow. Trained on GPU on Kaggle.
- Work in progress
Nature scene image classification (Summer 2019)
- Description: The objective of this project is to classify nature scene images into 6 distinct categories: buildings, forest, glacier, mountain, sea and street. See notebooks with local or kaggle suffix in the link above.
- Dataset: Intel image competition based dataset of 25k images labeled under 6 categories.
- Accuracy: 89% with 15 epochs (and possibly higher with more epochs)
- Packages/languages: Both pre-trained and original neural networks created for this purpose. Convolutional neural networks written in Python + Keras + Tensorflow. Trained on Kaggle with GPU.
Classification: predicting whether it will rain tomorrow (Summer 2019)
- Description: The objective of this project is to use non-neural ML methods to predict the occurrence of rain. See executive summary in repo above for high-level details.
- Dataset dimensions: 142k observations x 23 predictors
- Purpose of this project is predict whether it will rain tomorrow in Australia.
- Methods used: random forest, SVM, LDA (linear discriminant analysis), and a neural network. Recursive feature elimination with random forest used for subset selection.
- Accuracy: 85%
- Language: R
Regression: predicting retail price of vehicles (Summer 2019)
- Description: The objective of this project is to use non-neural ML methods to predict the retail price of vehicles. See midterm.pdf at link above for details.
- Dataset dimensions: 428 observations x 14 predictors
- This project demonstrates the prediction of retail price using 7 regression modeling methods.
- Methods used: random forest, bagging (bootstrap aggregation), boosting, multiple linear regression, shrinkage/regularization methods of Ridge, Lasso, and Elastic-Net regression, and K-nearest neighbors. regression subsets selection (regsubsets) used for model selection.
- Accuracy: 96% R^2
- Language: R
- Description: The objective of this project is to improve bank margins by optimizing loan-making decisions. Said differently: the goal is to predict the financial risk that each customer poses to the bank. See project pdf at link above for details.
- Dataset dimensions: The dataset used to train this model includes 50,000 loans and 30 variables.
- Methods used: Logistic modeling
- Accuracy: 80%
- Language: R
Regression: predicting health examinations (Summer 2018)
- Description: The objective of this project is to create a predictive model to forecast medical examinations for a health organization. Analysis is based on a Harvard Business Review case titled “The Fargo Health Group (FHG) Case”. This project demonstrates: (1) data imputation and (2) creating a predictive model using multiple time-series autoregressive forecasting methods. Read report summary at link above for details.
- Methods used: ARIMA and Holt’s exponential smoothing.
- Language: R
Simulation and Decision Analysis (Fall 2019)
- Description: The objective of this project is to perform simulation and decision analysis to determine the best decision (among a set of alternatives) for a given business scenario.
- Packages/languages: Python and ORtools (Google library)
Hyperparameter tuning with Scikit (Fall 2019)
- Description: The objective of this project is to determine the optimal hyperparameters for 2 models: RandomForest regressor and XGBoost classifier. Hyperparameter tuning is performed with 4 methods: GridSearchCV, RandomSearchCV, BayesianOptimization, and TPOT.
- Dataset: Sklearn diabetes dataset.
- Packages/languages: Python and Scikit-learn (sklearn)
Hadoop / Big Data
The list below consists of projects that solve big data-related business questions using the Hadoop framework. Languages/software used: Pig, Hive, Spark, Scala, Zeppelin, Python, Java, AWS (EMR, S3, EC2, Athena, Glue). (Spring 2019)
- Final Project -
- Part 1: Big data analysis with Hive and Scala on Spark
- Part 2: Parallel programming in Java: “Mailman’s dilemma” algorithm. In part 2, I created an algorithm to solve for the most efficient way for a mailman to deliver mail, given N number of buildings.
- Part 3: Big data on AWS: Flight analysis project. In part 3, I selected and hosted a big data set of flight arrival times data on AWS, and answered questions about this dataset using Scala.
- Other projects -
Natural Language Processing
Twitter Sentiment Analysis (Fall 2018)
- Description: The objective of this project is to answer the question: are people that talk about fitness happier than people that talk about media (tv, movies, youtube, etc.)?
- Methods used: Twitter data is collected using REST and Stream APIs, then cleaned, organized, and (sentiment) analyzed. Sentiment analysis is performed using Natural Language Toolkit VADER Sentiment Analysis. All of this is done in Python. Statistical analysis performed in R.
- Other: Twitter Sentiment Analyzer is the standalone program created to perform sentiment analysis on Twitter data.
Visualizing World Bank Data (Fall 2018)
- Description: The objective of this project is to demonstrate the iterative development of visualizations using principles of good design by Edward Tufte.
- Languages: Data cleaned, processed, and plotted in R.
Network Analysis of US Airports (Fall 2019)
- Description: This project demonstrates network analysis in R with data of air travel in the United States.
- Packages/languages: R, igraph, statnet