ReallyIsTrump Tweet Predictor

Determining if tweets from @realDonaldTrump are written by the President or his staff Donald Trump’s proclivity for using Twitter has changed how the White House interacts with the media and broadcasts to the president’s followers. It’s also lead to a birth of a new academic subject and hobby for data analysts. People have studied how engagement has changed during his first 100 days and conducted sentiment analysis of his tweets both during the campaign and as president. [Read More]

Forecasting Divvy bikesharing traffic

Comparing Exponential Smoothing, ARIMA, and Prophet to predict bikesharing traffic

Introduction Divvy is a bike sharing system for the city of Chicago that provides residents and tourists an option for getting around the city. After patrons purchase a daily or annual pass, they can unlock a bike, ride to their destination, and return the bike to one of the Divvy bike docking stations found throughout the city. The daily and annual pass includes 30 minutes of riding time, with additional fees for longer trips. [Read More]

A Shiny App for the Biology Alumni Survey

I recently built an interactive dashboard for the Elmhurst Biology Department’s alumni survey. Background Our biology department at Elmhurst is highly motivated to improve our teaching methods and styles. Currently, we are updating our introductory course curriculum with adding more active learning, standardizing laboratory modules, and developing central themes to introduce students to the field of biology and the scientific method. As part of this curriculum revision, we recently reached out to alumni through social media and asked them to evaluate their training here at Elmhurst in light of their current career path. [Read More]

West Nile Virus in Chicago, Part 1

This project will explore and analyze a dataset for West Nile Virus (WNV) detection in mosquitoes that were captured using traps around Chicago, IL. The data sets originate from the Kaggle competition, West Nile Virus Prediction. Even though the competition for this data set is over, I like using data sets from Kaggle to practice data viz and machine learning for the following reasons. The data is structured so there is instant gratification. [Read More]

Enova Data Smackdown Competition

Last night I attended a great data science meetup hosted by Enova Decisions in downtown Chicago. I’ve been to a few other data related meetups that mostly focused on talks or networking, but this one was all based around a challenging data set for participants to delve into. I heard word of the Smackdown from the fantastic Chicago R Users Group (RUG) earlier this week and was sure to register as soon as I could. [Read More]

Monte Carlo of Random Correlations

Exploring correlations of random numbers When working with big data you need to be more aware of statistical outliers than you do with more typical data sizes. Basic statistical tests like a Student’s t-test or Pearson correlation are acceptable when you only test a few relationships in a small data set. But when you examine the correlation with thousands of columns of data, you are bound to find several that are strongly correlated. [Read More]

Hadoop Popularity

Hive and Pig are no match for Spark

Exploring the popularity of Pig and Hive Pig and Hive are sometimes compared with one another for their ability to do data manipulations on a Hadoop cluster. There are some important differences. Hive is a direct implementation of the SQL language standard, which gives it a leg-up in terms of user familiarity. I wanted to see how the two compared in the number of posts on Stack Overflow a popular question/answer site for software developers. [Read More]

Working with Pigs

Grunt

This week I’ve been learning about the Pig language for Hadoop distributed computing systems. This is the first of several languages that we are covering this semester that were designed as abstractions on top of MapReduce. It’s really interesting to me how many different layers of programming can sit on top of one another and work together to make working with what is essentially machine language more human like. In the case of Pig, the language is called Pig Latin, and resembles SQL in several ways. [Read More]

How to set up RStudio on AWS for a Bioinformatics class

Motivations for the RStudio course server This fall is probably the most enjoyable semester I’ve ever had teaching. I had an opportunity to design and run an upper-level special topics course on bioinformatics at Elmhurst College. It’s a class I’ve always wanted to teach because for me, it’s like being able to organize a class around everything I would have wanted to know before going to graduate school for genetics. [Read More]

Grupo Bimbo Report

Predicting Demand of Bakery Goods

Grupo Bimbo Technical Report Project Overview Our client, Grupo Bimbo, wants to develop a model to accurately forecast inventory demand based on the historical sales data they collect. Grupo Bimbo is a large bakery store chain that has more than 2500 products spanning over one million stores. Their goal is to meet the product demand for their customers while minimizing unsold surplus. Our group is tasked to create a model to accurately forecast inventory demand based on the historical sales data provided. [Read More]