The following section contains topics and areas that I am actively researching.
Deep reinforcement learning
- Researching the BabyAI platform (see paper here), imitation learning, and grounded language learning with DL. This platform is a testbed for improving the sample efficiency of deep RL algorithms with the goal of making dramatic advances in grounded language learning.
- I’ve reproduced the sample efficiency results posted in Table 3 of the BabyAI paper for imitation learning from bot.
- Sequence to sequence RNN models for machine translation (pytorch)
- Transformers for neural machine translation
- Prototypical networks for few-shot classification
- Upcoming: graph neural networks
The following section contains my (mostly completed) projects, organized by topic/area of data science.
New York Times Comments Analysis (Fall 2019)
- Description: The objective of this project is perform text classification to predict New York Times article topics from the comments written by readers.
- Dataset: Data is from this kaggle competition. This open-source dataset contains NYT reader’s comments and articles from January to May 2017 and January to April 2018. The comments data contains “over 2 million comments and 34 features”, while the articles data contains more than 9000 articles and 16 features.
- Packages/languages: R and Keras.
Object detection: understanding clouds from satellite Images (Fall 2019)
- Description: The objective of this project is to predict the presence of specific cloud types in satellite images. However, this is not a standard object detection challenge because each image may contain multiple labels and multiple classes that are overlapping. That is, one training image may have an area that is labeled as 2 or more distinct classes. So this is the main challenge: how to deal with overlapping labels for objects that have indefinite shapes (like clouds).
- Dataset: Data is from this kaggle competition. Dataset consists of ~9,300 unique photos. 5,600 in training, 3,700 in test.
- Packages/languages: Convolutional neural network written in Python + Keras + Tensorflow. Trained on GPU on Kaggle.
- Work in progress
Nature scene image classification (Summer 2019)
- Description: The objective of this project is to classify nature scene images into 6 distinct categories: buildings, forest, glacier, mountain, sea and street. See notebooks with local or kaggle suffix in the link above.
- Dataset: Intel image competition based dataset of 25k images labeled under 6 categories.
- Accuracy: 89% with 15 epochs (and possibly higher with more epochs)
- Packages/languages: Both pre-trained and original neural networks created for this purpose. Convolutional neural networks written in Python + Keras + Tensorflow. Trained on Kaggle with GPU.
Classification: predicting whether it will rain tomorrow (Summer 2019)
- Description: The objective of this project is to use non-neural ML methods to predict the occurrence of rain. See executive summary in repo above for high-level details.
- Dataset dimensions: 142k observations x 23 predictors
- Purpose of this project is predict whether it will rain tomorrow in Australia.
- Methods used: random forest, SVM, LDA (linear discriminant analysis), and a neural network. Recursive feature elimination with random forest used for subset selection.
- Accuracy: 85%
- Language: R
Regression: predicting retail price of vehicles (Summer 2019)
- Description: The objective of this project is to use non-neural ML methods to predict the retail price of vehicles. See midterm.pdf at link above for details.
- Dataset dimensions: 428 observations x 14 predictors
- This project demonstrates the prediction of retail price using 7 regression modeling methods.
- Methods used: random forest, bagging (bootstrap aggregation), boosting, multiple linear regression, shrinkage/regularization methods of Ridge, Lasso, and Elastic-Net regression, and K-nearest neighbors. regression subsets selection (regsubsets) used for model selection.
- Accuracy: 96% R^2
- Language: R
- Description: The objective of this project is to improve bank margins by optimizing loan-making decisions. Said differently: the goal is to predict the financial risk that each customer poses to the bank. See project pdf at link above for details.
- Dataset dimensions: The dataset used to train this model includes 50,000 loans and 30 variables.
- Methods used: Logistic modeling
- Accuracy: 80%
- Language: R
Regression: predicting health examinations (Summer 2018)
- Description: The objective of this project is to create a predictive model to forecast medical examinations for a health organization. Analysis is based on a Harvard Business Review case titled “The Fargo Health Group (FHG) Case”. This project demonstrates: (1) data imputation and (2) creating a predictive model using multiple time-series autoregressive forecasting methods. Read report summary at link above for details.
- Methods used: ARIMA and Holt’s exponential smoothing.
- Language: R
Simulation and Decision Analysis (Fall 2019)
- Description: The objective of this project is to perform simulation and decision analysis to determine the best decision (among a set of alternatives) for a given business scenario.
- Packages/languages: Python and ORtools (Google library)
Hyperparameter tuning with Scikit (Fall 2019)
- Description: The objective of this project is to determine the optimal hyperparameters for 2 models: RandomForest regressor and XGBoost classifier. Hyperparameter tuning is performed with 4 methods: GridSearchCV, RandomSearchCV, BayesianOptimization, and TPOT.
- Dataset: Sklearn diabetes dataset.
- Packages/languages: Python and Scikit-learn (sklearn)
Hadoop / Big Data
The list below consists of projects that solve big data-related business questions using the Hadoop framework. Languages/software used: Pig, Hive, Spark, Scala, Zeppelin, Python, Java, AWS (EMR, S3, EC2, Athena, Glue). (Spring 2019)
- Final Project -
- Part 1: Big data analysis with Hive and Scala on Spark
- Part 2: Parallel programming in Java: “Mailman’s dilemma” algorithm. In part 2, I created an algorithm to solve for the most efficient way for a mailman to deliver mail, given N number of buildings.
- Part 3: Big data on AWS: Flight analysis project. In part 3, I selected and hosted a big data set of flight arrival times data on AWS, and answered questions about this dataset using Scala.
- Other projects -
Natural Language Processing
Twitter Sentiment Analysis (Fall 2018)
- Description: The objective of this project is to answer the question: are people that talk about fitness happier than people that talk about media (tv, movies, youtube, etc.)?
- Methods used: Twitter data is collected using REST and Stream APIs, then cleaned, organized, and (sentiment) analyzed. Sentiment analysis is performed using Natural Language Toolkit VADER Sentiment Analysis. All of this is done in Python. Statistical analysis performed in R.
- Other: Twitter Sentiment Analyzer is the standalone program created to perform sentiment analysis on Twitter data.
Visualizing World Bank Data (Fall 2018)
- Description: The objective of this project is to demonstrate the iterative development of visualizations using principles of good design by Edward Tufte.
- Languages: Data cleaned, processed, and plotted in R.
Network Analysis of US Airports (Fall 2019)
- Description: This project demonstrates network analysis in R with data of air travel in the United States.
- Packages/languages: R, igraph, statnet