BUSINESS INSIGHTS

Aug 18, 2015

Demo Day: Simplify Analysis of Big Data with Spark on Azure HDInsight

David Eldersveld Posted by David Eldersveld

Some of the key tasks in data science involve basic exploration of new or existing data.  Raw data is given structure, data can be joined to other datasets, features are selected for later analysis, and much more.  Depending on the questions to which you seek answers as well as other requirements, the process repeats until you have data that is ideal for further, more advanced, analytics.

ZeppelinChart.png

With Apache Spark on Azure HDInsight, these core tasks are made simpler with the inclusion of both the Apache Zeppelin and Jupyter notebooks.  In this Demo Day video, I walk through basic exploration of a city's traffic crash history using Zeppelin with both Spark DataFrames and Spark SQL.  I discuss some of the advantages of using Zeppelin and Spark for data of any volume.  Working with a new text file, I obtain an initial look at what features are available, see what cleansing may need to take place, and obtain a basic feel for the dataset through querying and visualization.  At this stage, I compute summary statistics as well as develop a repeatable process that can be used later.  While this is descriptive analysis, how can the data be prepared for other applications such as predictive analytics?

Overall, I can use the data to help bring me closer to answering my initial questions as well as prompt new questions.  For example:

  • Weather impacts road conditions.  During a snow storm, am I usually safer taking a two lane road or a freeway?  Freeways may have more accidents overall, but they also have a much higher traffic volume.  Factoring in a road's average daily traffic, do accidents during snow increase at similar rates for all road types--or increase at all?
  • College football home games increase traffic congestion.  Is there an increase in accidents that correlates with that congestion?  Do accidents on game days take place along main corridors to the stadium, or are they dispersed throughout the city?

View the video below to see how the Zeppelin notebook on a Spark on Azure HDInsight cluster can help me get answers.

Want to learn more about how data science in Azure can help your business?  Contact us for a consultation.

Free Self-Service BI eBook
David Eldersveld

About The Author

David Eldersveld

For over ten years, David has employed skills in technology development, decision science, data engineering and analysis, systems analysis, and project management. David's work is almost exclusively on the Microsoft data and analytics platform, building BI and advanced analytics solutions on Azure Machine Learning, Microsoft R, and Power BI. He is active in the Microsoft community, speaking at PASS events and SQL Saturdays around the U.S.