Connecting R to a Redshift Data Warehouse for Machine Learning

By Skyler Shasky | January 23, 2019 | Updated On: April 27, 2023 | Redshift

In an increasingly data-focused world, the term “machine learning” is more popular than ever. However, many find the concept intimidating and believe that it is too expensive, confusing, or time-consuming to be utilized within their organization. Luckily, there are tools available to make machine learning accessible and (relatively) easy for anybody willing to give it a try.

R is one of the most popular statistical computing languages available in the market today. It has the flexibility to run a variety of machine learning algorithms with applications across every industry. While R shines when it comes to statistical modeling, it benefits from the support of a well-defined data ingestion and warehousing process in order to curate the data sets required for analysis.

A data warehouse is a great option to provide this support. Data warehouses allow users to gather data from multiple disparate sources and house it in a central repository. This central source of truth can then be used for transformation, visualization, dashboarding, and advanced analytics.

Panoply is a managed data warehouse with built-in ETL that makes it easy to connect your data to R, Python, or a variety of other analytical tools.

For our demonstration, I’ve ingested a Kaggle dataset with 67 years of individual statistics for NBA players. Please note that I’m writing my R scripts in RStudio which is an IDE for R.

Collecting and visualizing data

To connect my R environment to Redshift, I employ the RPostgreSQL package. Here is the syntax used to ingest my NBA table into R as a dataframe.

 install.packages("RPostgreSQL")
 library(RPostgreSQL)
 driver <- dbDriver("PostgreSQL")
 SampleConnection <- dbConnect(driver,
 host = 'db.panoply.io',
 port = '5439',
 user = '*****',
 password = '*****',
 dbname = '*****')
 SampleDF = dbReadTable(SampleConnection, 'nba table')

I’m going to start by visualizing the data as a histogram. Shooting efficiency is a big topic in basketball these days, so it might be interesting to create a plot that shows me the distribution of true shooting percentage by season for all players. First I will add a couple of conditions so that I only visualize seasons past 1990, and only include players who had more than 1550 field goal attempts in a given season. After plotting the histogram, I see that the data is roughly in the shape of a bell curve, centered around approximately 57% true shooting percentage.

ModernEra = SampleDF[SampleDF$year >= 1990,] ModernEra = ModernEra[ModernEra$fga >= 1550,] hist(ModernEra$ts., xlab='True Shooting Percentage', main='Histogram of True Shooting Percentage')

Now I want to visualize this data as a scatterplot to identify the outliers. To do this I will plot the true shooting percentage against the number of field goal attempts. Once I label the data, it becomes clear that the outliers include Steph Curry, Kevin Durant, and Michael Jordan - for most fans that should not be a surprise!

 plot(ModernEra$ts.,ModernEra$fga, xlab='True Shooting Percentage', ylab='Field Goal Attempts', main='Scatterplot of FGA vs TS%')
 text(ModernEra$ts.,ModernEra$fga,labels=ModernEra$player,cex = .5,pos = 3)

Machine learning application

While visualizing the data was a fun exercise, I want to take R a step further by actually forming statistical models via machine learning. Specifically I will create models that will determine an NBA player’s position based on their performance in certain statistical categories. First, it is necessary to do some minor data cleaning, define the dataset, and split it into a testing and training set.

 ###Defining and cleaning the dataset
 Variables = c("X2p.", "ft.", "X3p.", "trb.", "age", "pos")
 Positions = c("PG","SG","SF","PF","C")
 TreeDF = SampleDF[Variables]
 TreeDF = TreeDF[TreeDF$pos %in% Positions,]
 TreeDF = TreeDF[complete.cases(TreeDF),]
 TreeDF$pos = as.factor(TreeDF$pos)
 TreeDF$pos = droplevels(TreeDF$pos)
 summary(TreeDF)

###Splitting into testing and training sets set.seed(13) Training = sample(nrow(TreeDF),.8*nrow(TreeDF), replace = FALSE) TrainingSet = TreeDF[Training,] TestingSet = TreeDF[-Training,]

Now it is time to apply an analytical modeling algorithm, for this classification use case I will try CART (classification and regression tree).

install.packages(rpart) library(rpart) NBA_CART = rpart(formula = pos ~ ., data = TrainingSet)

Next, I can visualize the tree that was created. After reviewing the tree, I can see that total rebound percentage and three point shooting percentage are the two biggest factors used in determining position based on the variables that were placed into the model.

plot(NBA_CART, margin = .1) text(NBA_CART)

With the model ready, I will utilize our testing data set to assess model performance.

TrainingPrediction_CART = predict(NBA_CART, TrainingSet, type = "class") table(TrainingPrediction_CART,TrainingSet$pos) mean(TrainingPrediction_CART == TrainingSet$pos) TestingPrediction_CART = predict(NBA_CART, TestingSet, type = "class") table(TestingPrediction_CART,TestingSet$pos) mean(TestingPrediction_CART == TestingSet$pos)

I find that the CART model is 51% accurate in assigning position when compared to the correct position in the data. Considering there are 5 different positions to choose from, this is pretty good!

CART was interesting, but I want to try one more statistical technique: Random Forest. Random forest is a powerful ensemble learning technique that can often create highly accurate classification and prediction models, though they can be difficult to interpret. I will create the model and test the accuracy.

NBA_Random_Forest = randomForest(formula = pos ~ ., data = TrainingSet, ntree=100) NBA_Random_Forest plot(NBA_Random_Forest) TrainingPrediction_RF <- predict(NBA_Random_Forest, TrainingSet, type = "class") table(TrainingPrediction_RF,TrainingSet$pos) mean(TrainingPrediction_RF == TrainingSet$pos) TestingPrediction_RF <- predict(NBA_Random_Forest, TestingSet, type = "class") table(TestingPrediction_RF,TestingSet$pos) mean(TestingPrediction_RF == TestingSet$pos)

The output shows that CART just beat out Random Forest in terms of classification accuracy. In this case I prefer the CART model because of the easily interpretable tree form of the output and the improved accuracy. Regardless of which model you might choose, the fact that anybody is able to create machine learning models with just a few lines of code is one of the coolest features of R.

To summarize the demonstration today, I collected data into R via an Amazon Redshift cluster in Panoply, visualized it, created predictive models, and tested their accuracy.

Hopefully as I went through this exercise you considered the problems unique to your organization that could be better understood through these techniques and technologies. This combination of R and Panoply might be the key to the next game-changing insight within your organization. Good luck and happy coding!