Microsoft R

Resources for Microsoft R:  A Revolution in Advanced Analytics

LP Arrow

We hope you enjoy this collection of resources for Microsoft R.  Keep an eye on the BlueGranite Blog and Resources for additional Microsoft R and advanced analytics posts and videos.

 

Webinar

R statistical computing has grown from a research and prototyping tool into a key component of commercial analytics platforms. The greatest evidence of this may be in Microsoft’s acquisition and subsequent embedding of Revolution R in its key data and analytics technologies.

Whether used locally or in the cloud, Microsoft R is a powerful tool for data discovery and predictive analytics. And with Microsoft's ‘Write Once, Deploy Anywhere’ approach, R code can be written once and deployed in a wide range of data management platforms, enterprise data warehouses, servers, and workstations without re-developing in another production platform.

Join us in this session as the BlueGranite advanced analytics team reviews the benefits of R, especially in the Microsoft analytics ecosystem.

In this webinar you'll learn about:

  • An introduction to the R platform – including a getting started guide

  • An overview of the major components of the Microsoft R ecosystem – including features of parallel processing on big data, and using R in environments like SQL Server 2016, PowerBI, and Hadoop

  • A demonstration of R in action in multiple platforms – from prototyping to production

  • R resources and tips on executing a successful R project

  • The business value of Microsoft’s ‘Write Once, Deploy Anywhere’ approach with R - key business issues around the ability to operationalize analytics and avoid re-engineering costs

View the webinar recording.

Tutorials

 

SQL Server R ServicesTutorial: SQL Server R Services

 

This tutorial walks through some basic development and deployment scenarios for R Services in SQL Server 2016. You will use sample order data and take R code that can be run standalone in RStudio or another R IDE, adapt it for use in SQL Server, and then consume it both in SQL Management Studio as well as from Reporting Services.

Objectives

  • Learn about some of the Microsoft “RevoScaleR” functions
  • Adapt R code for a logistic regression model for use in SQL Server
  • Embed R code in SQL Server stored procedures
  • Make predictions on new data using a trained R model stored in SQL Server
  • View a plot in Reporting Services using R output from SQL Server

Prerequisites

  • Download and install the free Microsoft R Client on your local workstation. R Client includes the ScaleR (rx[…]) functions in the RevoScaleR package.
  • Download and install an R IDE such as RStudio or R Tools for Visual Studio.
  • Download and install SQL Server Management Studio.
  • Download and install Visual Studio and SQL Server Data Tools (only needed for Activity 3 - Reporting Services).
  • Verify access to an instance of SQL Server 2016 configured for R Services.
  • Verify access to an instance of SQL Server 2016 that has the WideWorldImportersDW sample database.

Activity 1 - Develop in R

  • In RStudio, open the RevoScaleR-vs-Traditional.R file.

  • Install the RODBC package (i.e. install.packages(“RODBC”) ) if needed. RevoScaleR should already be available if you have Microsoft R Client installed.

  • Run the setup code in sequence to reference the RODBC and RevoScaleR libraries and reset your compute context to “local” if you have already been using a SQL compute context outside of this tutorial.

    Develop in R

  • Replace [SERVERNAME] in sql.conn.string with your SQL Server instance. You may need to change the trusted connection to Uid= and Pwd= depending on your authentication type. Once your connection string is ready, run the code to add sql.conn.string and source.query to your R environment. The source query is a very simple query running against the WWI Fact.Order table. For simplicity, it does not join to any dimensions to get descriptions.

    Develop in R

  • You will now look at one traditional method to analyze data from SQL Server. You will create an ODBC connection and import data from SQL Server into a local data frame named orders. Run the following code to load data and get the first few observations.

    Develop in R

  • Look at the first few observations. SameDayFulfillment is an engineered feature with a 0/1 value noting if an order was packed the same day that it was placed. If 1, it was same day fulfillment. If 0, it was not. The other variables are unmodified key values from the Fact.Order table. You should also notice the orders data frame in your environment window with 29913 observations and 5 variables.

    orders

  • You will now run a very basic logistic regression using the built-in glm function on the local data frame and view the coefficients. Fitting an accurate model is not the goal of this tutorial, but in the real world, you would go through the appropriate training and testing steps.

    glm

    compute context

  • Now that you’ve seen a quick overview of the traditional method, you will switch your compute context and process data in-database instead of locally. You will also change from traditional functions to the ScaleR “rx”-prefixed functions from the RevoScaleR package. Run the following code to change your compute context from local to SQL Server.

    RevoScaleR package

  • Run the following code to build your SQL query.

    SQL query

  • The following code using rxImport() is here for information more than function. If you wanted to bring your data into a local data frame, you can use rxImport with the in-database object.

    rxImport()

  • Otherwise, run the following code to train a logistic regression model using the SQL data object. Note that you now use the rxLogit() function (equivalent to glm above). You could take it a step further on your own and use the rxPredict() function with new data to classify whether or not additional orders would be fulfilled same day or not.

    rxPredict()

    rxLogit()

  • Finally, run the following code to produce a histogram displaying the quantity of items per order. Note the positive/right skew where most orders have fifty items or less and the rest of the bins compose a long tail to the right.

    histogram histogram

Don’t worry about analyzing this data. The logistic regression and histogram are simple examples that will be used to show how to embed and work with your R code in SQL Server. The plot will surface again in the later Reporting Services activity.

 

Activity 2 - Adapt R Code for SQL Server

  • In Management Studio, open the R-Services-Example.sql file

  • Connect to your SQL instance that has the WWI DW sample database and R Services installed.

  • Execute the following code to verify that you are using the correct database.

    screenshot-sql-01.png

  • You will now take some time to explore a very basic example of the new sp_execute_external_script stored procedure, which allows you to work with R code using R Services. The required parameters are @language, @script, and @input_dataset_1. There are additional optional parameters that are not currently being used. @script is where the adapted R code goes. Note that except for changing the source data to InputDataSet, the code itself is similar to the logistic regression you ran earlier in RStudio. The final line of @script is there simply to print to stdout. Likewise, note that @input_dataset_1 is similar to the source SQL query that you used in RStudio. Execute the code and note that the output on the Messages tab in Management Studio. Does it appear similar to the rxLogit() output in RStudio?

    sp_execute_external_script
    InputDataSet

  • The prior execution output to stdout, but now you will wrap that code in another stored procedure for repeatable use. In the following code, note the changes. You are now creating a stored procedure called dbo.sp_EmbeddedRSample. Instead of printing the output, you are serializing the trained model, converting it to a data frame, and storing it in OutputDataSet. Finally, note the addition of WITH RESULT SETS at the end and how a binary model is the new output. Execute the code to create the new stored procedure.

    OutputDataSet

  • Execute the following code to create a table that is used to store the model. The final section executes the earlier stored procedure and inserts the model into the new table.

    create table

  • You will now create a second stored procedure to predict same day fulfillments based on the earlier model. Note that the new dbo.sp_PredictRSample procedure takes a number of parameters that correspond to the SQL input you have already been familiar with. The procedure uses the model you trained earlier, and the output includes the new data used to predict as well as a new column called PredictedSameDayFulfillment.

    PredictedSameDayFulfillment

  • Run the following code, which executes the prediction stored procedure with one record of sample data. Note that the output in this case was ~0.67, which unconvincingly leans toward same day fulfillment because it is closer to 1 than 0.

    stored procedure

    quantity

  • Change the quantity value and note the new prediction. In this case, increasing the quantity ordered to 1000 results in a predicted value much closer to 0. Based on the established training data, high-quantity orders likely cannot be fulfilled same day as easily. In the real world, this could be due to lack of inventory or other factors.

    lack of inventory

  • As a final step, you will create a new stored procedure as a lead-in for the next Reporting Services activity. Execute the following code to create the dbo.sp_HistogramSample procedure. Note that the output is binary as it was when you embedded the R model earlier. While the final output will be the same image as what you viewed in RStudio, this R code is noticeably different than what you ran in RStudio. While you cannot see the image in Management Studio, the binary can be consumed in client tools such as Reporting Services.

    dbo.sp_HistogramSample


Activity 3 - Visualize an R Plot in Reporting Services

  • Download and unzip the R-Services-Sample-Reports.zip file

  • Open the R-Services-Sample-Reports solution in the latest version of SQL Server Data Tools.

    R-Services-Sample-Reports

  • Open the SampleHistogram.rdl file from Solution Explorer.

    SampleHistogram

  • Open WideWorldImportersDW under Data Sources in Report Data. Change the connection string to your own SQL Server instance.

    WideWorldImportersDW

  • Expand Datasets in Report Data and view Dataset1. Note how the dataset uses the sp_HistogramSample stored procedure as a source. You may need to use the dropdown menu and select the stored procedure using your own SQL connection.

    sp_HistogramSample

  • Right-click on the empty image object on the report canvas, then select Image Properties. In the Properties window, note how the image is using a Database source, selecting the plot from DataSet1, and using the jpeg MIME type.

    Image Properties

  • Click Preview to view the report, and you should see the same histogram that you viewed in RStudio.

    Preview



Azure HDInsightTutorial: R Server on Azure HDInsight

R Server on Azure HDInsight is a convenient way to utilize R in a distributed environment. In the following tutorial, you will create an HDInsight cluster that includes R Server using the Azure Portal. You will then install RStudio on the R Server edge node. Finally, you will use R to manage HDFS resources, change compute contexts, and build models under each context. General familiarity with HDInsight, HDFS, and R is assumed.

Objectives

  • Create an R Server on Azure HDInsight cluster
  • Install RStudio on the cluster
  • Run a sample R script exploring the ability to work with Hadoop commands, change compute contexts, and build models

Prerequisites

  • An Azure subscription
  • An SSH client such as PuTTY


Activity 1 - Create an R Server on HDInsight Cluster

  • In a browser, navigate to the Azure Portal at https://portal.azure.com. Sign in with the Microsoft account associated with your Azure subscription if prompted.

  • Once in the Azure Portal, click New in the upper left menu, select Data and Analytics, and then select HDInsight.

    Data and Analytics

  • Enter a unique Cluster Name and click on Select Cluster Type. Select the R Server on Spark option and click Select.

    Cluster Name

  • Under Credentials, enter separate logins for the cluster admin and SSH user. Remember the SSH credentials since you will need these later to connect to your cluster. When ready, click Select.

    HDInsight Cluster

  • Configure the rest of the cluster options as required. When ready, click Create at the bottom of the panel. It may take up to 40 minutes to create your cluster. While you wait, you will see a notification in the upper right corner showing progress.


Activity 2 - Create an R Server on HDInsight Cluster

  • Follow the instructions from Azure documentation to install R Studio on your R Server edge node. You will need to wait for your cluster to be ready before you can proceed with these steps. You will also need your SSH client installed to connect to the cluster.

  • After completing the steps from Azure documentation, be sure to leave R Studio open in your browser at localhost:8787.

 

Activity 3 - Use R to Manage Hadoop and Build Models under Different Compute Contexts

R Server on HDInsight offers processing under a variety of compute contexts such as Local, Local Parallel, and Spark. In this activity, you will explore changing between different contexts and building logistic regression models under each.

Compute Contexts

  • Enter the following command into your SSH client while it is connected to the R Server edge node. This downloads the income.csv file to your cluster’s storage. Optionally, you can also verify that income.csv downloaded correctly by typing ls into the console after having run the wget command.

    wget https://cdn.rawgit.com/BlueGranite/Microsoft-R-Resources/master/hdinsight-r-server-spark/income.csv
    SSH client

  • Open the tutorial's HDInsight.R script in RStudio at localhost:8787 (or copy and paste the code in this tutorial as needed). 

  • Run the following R code to reset the compute context to Local if needed. Note the rx prefix, which means that you are using a ScaleR function.

    rxSetComputeContext('local')

    HDInsight.R

  • Run the following R code to check the Hadoop version.

    rxHadoopVersion()

    R code

    R code

  • Run the following R code in sequence to make a new directory in HDFS, copy the income.csv file (that you downloaded earlier using your SSH client) into the directory, and list the directory contents. Note that you used the rxHadoop[…] commands instead of standard shell commands such as mkdir and ls.

    rxHadoopMakeDir("/data/income/")
    rxHadoopCopyFromLocal("~/income.csv", "/data/income/")
    rxHadoopListFiles("/data", recursive = TRUE)

    R code

    R code

  • Run the following R code in sequence to set the file path, verify that you are reading from HDFS, build a text object from the CSV file, and then view the structure of the object.

    incomepath <- file.path("/data/income/income.csv")
    hdfsFS <-RxHdfsFileSystem()
    incomedata <- RxTextData(incomepath, fileSystem = hdfsFS, stringsAsFactors = TRUE)
    rxGetVarInfo(incomedata)

    R code

    R code

  • Run the following R code in sequence to build a logistic regression model under the Local context.

    incomeformula <- "IncBin ~ age + education + race + sex"
    model <- rxLogit(formula = incomeformula, data = incomedata)
    summary(model)

    R code

    R code

  • Run the following R code in sequence to switch to Local Parallel and build a logistic regression model in the new context.

    rxSetComputeContext('localpar')
    model <- rxLogit(formula = incomeformula, data = incomedata)
    summary(model)

    R code

  • Run the following R code in sequence to switch to Spark and build a logistic regression model in the new context.

    spark <-RxSpark()
    rxSetComputeContext(spark)
    model <- rxLogit(formula = incomeformula, data = incomedata)
    summary(model)

    R code

 

 

Power BI and RTutorial: Power BI and R

There are two different ways to apply R in Power BI: the R Script for loading and transforming data, and the R Visual for additional enhancement and data visualization. The following tutorial walks through a basic scenario using both.

Objectives

  • Configure R in Power BI
  • Use the R Script for loading data
  • Create an R Visual

Prerequisites

 

Activity 1 - Configure R in Power BI

  • The first time you use Power BI, you will need to locate your local R installation. This is a one-time setup unless you upgrade R in the future.
  • Open Power BI Desktop. Close the startup window with Get Data, recent files, etc. if it appears.
  • Go to File, select Options and settings, and then select Options.
  • Under the Global heading, select R Scripting. Power BI attempts to automatically detect a local R installation. If you have multiple versions of R installed, you can select an alternate version from the menu.


    Power BI

  • Click OK when done.

 

Activity 2 - Load Data with R Script

  • From the main report canvas, select Get Data on the Home tab, then select More….

    Power BI

  • Choose R Script from the list of data sources and then click Connect.

    Get Data

  • Copy and paste the following code into the Execute R Script window, then click OK. This code simply references a remote CSV file and loads it into a data frame without transforming it in any way. In the real world, to take true advantage of the R Script, you can do some more advanced data cleansing and transformation with R prior to executing the script.
    source <- "https://cdn.rawgit.com/BlueGranite/Microsoft-R-Resources/master/power-bi/gameconsole.csv"  
    game.console <- read.csv(source, header = TRUE)

    Execute R Script

  • On the Navigator window, check the box next to game.console and then click Load. You should see “game console” appears in the “Fields” panel with three fields: Console, Date, and Search.

    Navigator


 

Activity 3 - Plot a Time Series Forecast with R Visual

  • Click on the R visual in the Visualizations panel. You may also receive a message stating that you need to Enable Script Visuals. If so, select Enable. An R script editor appears on the report canvas. A message states that you need to drag fields into the Values area.

    Visualizations

  • Drag Console, Date, and Search from the Fields panel into Values. Note that the R script editor changes to show these fields and adds them to a data frame.

    Console, Date, and Search
    R script editor

  • Copy and paste the following code into the R script editor, then click Run. (Note that you may need to first install the forecast package in your local R environment using install.packages(“forecast”) ). After a few seconds, you should see a plot display on the report.

    library(forecast)
    
    # group searches by date
    dataset <- aggregate(Search ~ Date, data = dataset, FUN= sum)
    
    # create a time series based on month
    ts <- ts(dataset$Search, frequency=4)
    
    # pull out the seasonal, trend, and irregular components from the time series (train the forecast model)
    decom <- stl(ts, s.window = "periodic")
    
    # predict the next 3 months of searches
    pred <- forecast(decom, h = 12)
    
    # plot the predicted searches
    plot(pred)

 

 R script editor

  • Add a second R visual to the report canvas and drag the Console, Date, and Search from the Fields panel into Values.
  • Copy and paste the following code into the new R script editor, then click Run. This script also requires the forecast package. After a few seconds, you should see a plot display on the report.
    library(forecast)
    
    # group searches by date
    dataset <- aggregate(Search ~ Date, data = dataset, FUN= sum)
    
    # create a time series based on month
    ts <- ts(dataset$Search, frequency=4)
    
    # pull out the seasonal, trend, and irregular components from the time series (train the forecast model)
    decom <- stl(ts, s.window = "periodic")
    
    plot(decom)

 


 R script editor

  • Add a Slicer visualization to the report canvas and add Console to the slicer’s Values. Check the box next to each of the three gaming consoles to see how the forecast changes for each. Slicer

 

 

 

Additional Resources

CRAN (Open-source R) - https://cran.r-project.org/

Microsoft R Portal - https://mran.revolutionanalytics.com/

Getting Started with R - https://mran.revolutionanalytics.com/documents/getting-started/

Microsoft R Strategy - http://www.zdnet.com/article/microsofts-r-strategy/

R Server Home Page - https://www.microsoft.com/en-us/server-cloud/products/r-server/

SQL Server 2016 E-Book - https://blogs.msdn.microsoft.com/microsoft_press/2015/12/22/free-ebook-introducing-microsoft-sql-server-2016-mission-critical-applications-deeper-insights-hyperscale-cloud-preview-edition/

Azure Data Science Virtual Machine - https://azure.microsoft.com/en-us/marketplace/partners/microsoft-ads/standard-data-science-vm/

Azure ML FAQ - https://azure.microsoft.com/en-us/documentation/articles/machine-learning-faq/?cdn=disable

Azure ML Templates with SQL Server R Services - https://gallery.cortanaintelligence.com/Collection/ML-Templates-with-SQL-Server-R-Services-1

Azure ML Fraud Detection Template - https://gallery.cortanaintelligence.com/Collection/Online-Fraud-Detection-Template-1

R Server for HDInsight - https://azure.microsoft.com/en-us/services/hdinsight/r-server/

Create Power BI Visuals Using R - https://powerbi.microsoft.com/en-us/documentation/powerbi-desktop-r-visuals/

Connected Cows presentation - https://www.youtube.com/watch?v=oY0mxwySaSo

 

 

Related BlueGranite Blog Posts

Advanced Analytics

http://www.blue-granite.com/blog/built-in-analytics-why-r-matters-to-a-sql-server-professional

http://www.blue-granite.com/blog/hiring-retention-with-predictive-analytics

http://www.blue-granite.com/blog/hands-on-labs-azure-ml

http://www.blue-granite.com/blog/top-6-use-cases-to-help-you-understand-big-data-analytics

Data Lakes / Hadoop

http://www.blue-granite.com/blog/5-reasons-to-get-excited-about-sql-server-2016-and-big-data

http://www.blue-granite.com/blog/apache-solr-3-analytic-use-cases

http://www.blue-granite.com/blog/make-big-data-easy-with-microsoft-azure-data-lake

http://www.blue-granite.com/blog/demo-day-exploratory-analysis-spark-hdinsight

Power BI  / Azure

http://www.blue-granite.com/blog/tour-of-the-power-bi-ecosystem

http://www.blue-granite.com/blog/new-microsoft-power-bi-governance-and-deployment-whitepaper

http://www.blue-granite.com/blog/3-pointers-to-make-your-microsoft-power-bi-report-a-winner

http://www.blue-granite.com/blog/improve-the-effectiveness-of-your-bi-and-analytics-programs-with-a-data-catalog

http://www.blue-granite.com/blog/video-overview-of-microsoft-bi-reporting-tool-integration

http://www.blue-granite.com/blog/automate-pulling-in-life-changing-external-data-with-azure-logic-apps

SQL Server 2016

http://www.blue-granite.com/blog/top-3-reasons-to-upgrade-to-sql-server-2016-reporting-services

http://www.blue-granite.com/blog/3-reasons-to-upgrade-to-sql-server-analysis-services-2016

http://www.blue-granite.com/blog/top-3-reasons-to-upgrade-your-analytics-environment-to-sql-server-2016

Questions about Microsoft R?