Resources for Microsoft R: A Revolution in Advanced Analytics
R statistical computing has grown from a research and prototyping tool into a key component of commercial analytics platforms. The greatest evidence of this may be in Microsoft’s acquisition and subsequent embedding of Revolution R in its key data and analytics technologies.
Whether used locally or in the cloud, Microsoft R is a powerful tool for data discovery and predictive analytics. And with Microsoft's ‘Write Once, Deploy Anywhere’ approach, R code can be written once and deployed in a wide range of data management platforms, enterprise data warehouses, servers, and workstations without re-developing in another production platform.
Join us in this session as the BlueGranite advanced analytics team reviews the benefits of R, especially in the Microsoft analytics ecosystem.
An introduction to the R platform – including a getting started guide
An overview of the major components of the Microsoft R ecosystem – including features of parallel processing on big data, and using R in environments like SQL Server 2016, PowerBI, and Hadoop
A demonstration of R in action in multiple platforms – from prototyping to production
R resources and tips on executing a successful R project
The business value of Microsoft’s ‘Write Once, Deploy Anywhere’ approach with R - key business issues around the ability to operationalize analytics and avoid re-engineering costs
View the webinar recording.
This tutorial walks through some basic development and deployment scenarios for R Services in SQL Server 2016. You will use sample order data and take R code that can be run standalone in RStudio or another R IDE, adapt it for use in SQL Server, and then consume it both in SQL Management Studio as well as from Reporting Services.
In RStudio, open the RevoScaleR-vs-Traditional.R file.
Install the RODBC package (i.e. install.packages(“RODBC”) ) if needed. RevoScaleR should already be available if you have Microsoft R Client installed.
Run the setup code in sequence to reference the RODBC and RevoScaleR libraries and reset your compute context to “local” if you have already been using a SQL compute context outside of this tutorial.
Replace [SERVERNAME] in sql.conn.string with your SQL Server instance. You may need to change the trusted connection to Uid= and Pwd= depending on your authentication type. Once your connection string is ready, run the code to add sql.conn.string and source.query to your R environment. The source query is a very simple query running against the WWI Fact.Order table. For simplicity, it does not join to any dimensions to get descriptions.
You will now look at one traditional method to analyze data from SQL Server. You will create an ODBC connection and import data from SQL Server into a local data frame named orders. Run the following code to load data and get the first few observations.
Look at the first few observations. SameDayFulfillment is an engineered feature with a 0/1 value noting if an order was packed the same day that it was placed. If 1, it was same day fulfillment. If 0, it was not. The other variables are unmodified key values from the Fact.Order table. You should also notice the orders data frame in your environment window with 29913 observations and 5 variables.
You will now run a very basic logistic regression using the built-in glm function on the local data frame and view the coefficients. Fitting an accurate model is not the goal of this tutorial, but in the real world, you would go through the appropriate training and testing steps.
Now that you’ve seen a quick overview of the traditional method, you will switch your compute context and process data in-database instead of locally. You will also change from traditional functions to the ScaleR “rx”-prefixed functions from the RevoScaleR package. Run the following code to change your compute context from local to SQL Server.
Run the following code to build your SQL query.
The following code using rxImport() is here for information more than function. If you wanted to bring your data into a local data frame, you can use rxImport with the in-database object.
Otherwise, run the following code to train a logistic regression model using the SQL data object. Note that you now use the rxLogit() function (equivalent to glm above). You could take it a step further on your own and use the rxPredict() function with new data to classify whether or not additional orders would be fulfilled same day or not.
Finally, run the following code to produce a histogram displaying the quantity of items per order. Note the positive/right skew where most orders have fifty items or less and the rest of the bins compose a long tail to the right.
Don’t worry about analyzing this data. The logistic regression and histogram are simple examples that will be used to show how to embed and work with your R code in SQL Server. The plot will surface again in the later Reporting Services activity.
In Management Studio, open the R-Services-Example.sql file
Connect to your SQL instance that has the WWI DW sample database and R Services installed.
Execute the following code to verify that you are using the correct database.
You will now take some time to explore a very basic example of the new sp_execute_external_script stored procedure, which allows you to work with R code using R Services. The required parameters are @language, @script, and @input_dataset_1. There are additional optional parameters that are not currently being used. @script is where the adapted R code goes. Note that except for changing the source data to InputDataSet, the code itself is similar to the logistic regression you ran earlier in RStudio. The final line of @script is there simply to print to stdout. Likewise, note that @input_dataset_1 is similar to the source SQL query that you used in RStudio. Execute the code and note that the output on the Messages tab in Management Studio. Does it appear similar to the rxLogit() output in RStudio?
The prior execution output to stdout, but now you will wrap that code in another stored procedure for repeatable use. In the following code, note the changes. You are now creating a stored procedure called dbo.sp_EmbeddedRSample. Instead of printing the output, you are serializing the trained model, converting it to a data frame, and storing it in OutputDataSet. Finally, note the addition of WITH RESULT SETS at the end and how a binary model is the new output. Execute the code to create the new stored procedure.
Execute the following code to create a table that is used to store the model. The final section executes the earlier stored procedure and inserts the model into the new table.
You will now create a second stored procedure to predict same day fulfillments based on the earlier model. Note that the new dbo.sp_PredictRSample procedure takes a number of parameters that correspond to the SQL input you have already been familiar with. The procedure uses the model you trained earlier, and the output includes the new data used to predict as well as a new column called PredictedSameDayFulfillment.
Run the following code, which executes the prediction stored procedure with one record of sample data. Note that the output in this case was ~0.67, which unconvincingly leans toward same day fulfillment because it is closer to 1 than 0.
Change the quantity value and note the new prediction. In this case, increasing the quantity ordered to 1000 results in a predicted value much closer to 0. Based on the established training data, high-quantity orders likely cannot be fulfilled same day as easily. In the real world, this could be due to lack of inventory or other factors.
As a final step, you will create a new stored procedure as a lead-in for the next Reporting Services activity. Execute the following code to create the dbo.sp_HistogramSample procedure. Note that the output is binary as it was when you embedded the R model earlier. While the final output will be the same image as what you viewed in RStudio, this R code is noticeably different than what you ran in RStudio. While you cannot see the image in Management Studio, the binary can be consumed in client tools such as Reporting Services.
Download and unzip the R-Services-Sample-Reports.zip file
Open the R-Services-Sample-Reports solution in the latest version of SQL Server Data Tools.
Open the SampleHistogram.rdl file from Solution Explorer.
Open WideWorldImportersDW under Data Sources in Report Data. Change the connection string to your own SQL Server instance.
Expand Datasets in Report Data and view Dataset1. Note how the dataset uses the sp_HistogramSample stored procedure as a source. You may need to use the dropdown menu and select the stored procedure using your own SQL connection.
Right-click on the empty image object on the report canvas, then select Image Properties. In the Properties window, note how the image is using a Database source, selecting the plot from DataSet1, and using the jpeg MIME type.
Click Preview to view the report, and you should see the same histogram that you viewed in RStudio.
R Server on Azure HDInsight is a convenient way to utilize R in a distributed environment. In the following tutorial, you will create an HDInsight cluster that includes R Server using the Azure Portal. You will then install RStudio on the R Server edge node. Finally, you will use R to manage HDFS resources, change compute contexts, and build models under each context. General familiarity with HDInsight, HDFS, and R is assumed.
In a browser, navigate to the Azure Portal at https://portal.azure.com. Sign in with the Microsoft account associated with your Azure subscription if prompted.
Once in the Azure Portal, click New in the upper left menu, select Data and Analytics, and then select HDInsight.
Enter a unique Cluster Name and click on Select Cluster Type. Select the R Server on Spark option and click Select.
Under Credentials, enter separate logins for the cluster admin and SSH user. Remember the SSH credentials since you will need these later to connect to your cluster. When ready, click Select.
Configure the rest of the cluster options as required. When ready, click Create at the bottom of the panel. It may take up to 40 minutes to create your cluster. While you wait, you will see a notification in the upper right corner showing progress.
Follow the instructions from Azure documentation to install R Studio on your R Server edge node. You will need to wait for your cluster to be ready before you can proceed with these steps. You will also need your SSH client installed to connect to the cluster.
After completing the steps from Azure documentation, be sure to leave R Studio open in your browser at localhost:8787.
R Server on HDInsight offers processing under a variety of compute contexts such as Local, Local Parallel, and Spark. In this activity, you will explore changing between different contexts and building logistic regression models under each.
Enter the following command into your SSH client while it is connected to the R Server edge node. This downloads the income.csv file to your cluster’s storage. Optionally, you can also verify that income.csv downloaded correctly by typing ls into the console after having run the wget command.
Open the tutorial's HDInsight.R script in RStudio at localhost:8787 (or copy and paste the code in this tutorial as needed).
Run the following R code to reset the compute context to Local if needed. Note the rx prefix, which means that you are using a ScaleR function.
Run the following R code to check the Hadoop version.
Run the following R code in sequence to make a new directory in HDFS, copy the income.csv file (that you downloaded earlier using your SSH client) into the directory, and list the directory contents. Note that you used the rxHadoop[…] commands instead of standard shell commands such as mkdir and ls.
rxHadoopMakeDir("/data/income/") rxHadoopCopyFromLocal("~/income.csv", "/data/income/") rxHadoopListFiles("/data", recursive = TRUE)
Run the following R code in sequence to set the file path, verify that you are reading from HDFS, build a text object from the CSV file, and then view the structure of the object.
incomepath <- file.path("/data/income/income.csv") hdfsFS <-RxHdfsFileSystem() incomedata <- RxTextData(incomepath, fileSystem = hdfsFS, stringsAsFactors = TRUE) rxGetVarInfo(incomedata)
Run the following R code in sequence to build a logistic regression model under the Local context.
incomeformula <- "IncBin ~ age + education + race + sex" model <- rxLogit(formula = incomeformula, data = incomedata) summary(model)
Run the following R code in sequence to switch to Local Parallel and build a logistic regression model in the new context.
rxSetComputeContext('localpar') model <- rxLogit(formula = incomeformula, data = incomedata) summary(model)
Run the following R code in sequence to switch to Spark and build a logistic regression model in the new context.
spark <-RxSpark() rxSetComputeContext(spark) model <- rxLogit(formula = incomeformula, data = incomedata) summary(model)
There are two different ways to apply R in Power BI: the R Script for loading and transforming data, and the R Visual for additional enhancement and data visualization. The following tutorial walks through a basic scenario using both.
source <- "https://cdn.rawgit.com/BlueGranite/Microsoft-R-Resources/master/power-bi/gameconsole.csv" game.console <- read.csv(source, header = TRUE)
library(forecast) # group searches by date dataset <- aggregate(Search ~ Date, data = dataset, FUN= sum) # create a time series based on month ts <- ts(dataset$Search, frequency=4) # pull out the seasonal, trend, and irregular components from the time series (train the forecast model) decom <- stl(ts, s.window = "periodic") # predict the next 3 months of searches pred <- forecast(decom, h = 12) # plot the predicted searches plot(pred)
library(forecast) # group searches by date dataset <- aggregate(Search ~ Date, data = dataset, FUN= sum) # create a time series based on month ts <- ts(dataset$Search, frequency=4) # pull out the seasonal, trend, and irregular components from the time series (train the forecast model) decom <- stl(ts, s.window = "periodic") plot(decom)
CRAN (Open-source R) - https://cran.r-project.org/
Data Lakes / Hadoop
Power BI / Azure
SQL Server 2016