BUSINESS IMPACT

Mar 26, 2019

Databricks Dashboards: Data Exploration With Salary Classification

Megan Quinn Posted by Megan Quinn

For many companies, the initial attraction to Azure Databricks is the platform’s ability to process big data in a fast, secure, and collaborative environment. However, another highly advantageous feature is the Databricks dashboard. Dashboards are created directly through an existing Databricks notebook via a single click. They are essentially a presentation-friendly view of a Databricks notebook.

Databricks Dashboards Data Exploration With Salary Classification

One of the more challenging tasks for data scientists and engineers is explaining the function and results of their code in both an interesting and intelligible manner to key stakeholders. While clients with programming experience may enjoy delving into lines of code, clients who focus more on marketing may place greater emphasis on result presentations. A commonly attempted solution is to piece together a combination of text, code, and results in a PowerPoint presentation, a time-consuming process that still may fail to capture an accurate overview of the entire analysis. This post highlights some of the dashboard’s useful features that aid in resolving these issues through a use case of classifying salary brackets. For this example, the data derives from census information of individuals, along with their annual income. The goal is to predict if an individual earns less than or equal to $50,000 or more than $50,000.

Data Exploration

The first step in any type of analysis is to understand the dataset itself. A Databricks dashboard can provide a concise format in which to present relevant information about the data to clients, as well as a quick reference for analysts when returning to a project.

To create this dashboard, a user can simply switch to Dashboard view instead of Code view under the View tab. The user can either click on an existing dashboard or create a new one. Creating a new dashboard will automatically display any of the visualizations present in the notebook. Customization of the dashboard is easily achieved by clicking on the chart icon in the top right corner of the desired command cells to add new elements.

dashboard creation

exploring the data

In this example, the Exploring the Data Dashboard shown below provides a general description of the dataset, as well as highlighting interesting trends. From this dashboard we can observe that the age is right-skewed and that there are more individuals in the less than $50,000 bracket than the greater than $50,000 bracket. We can also see that a possible key relationship exists between income and gender. From the pie charts, males seem more likely to fall in the over $50,000 bracket than females. To prove this is a statistically significant difference, the dashboard displays a breakdown of the descriptive statistics associated with each gender, as well as the results of a 2 sample T-test.

Creating this type of initial data exploration dashboard can serve as an excellent way for data scientists to organize their thoughts about potential influential factors to consider during analysis, as well as highlight to clients possibly undiscovered trends in their data.

salary class

Another valuable feature of the Databricks dashboard is the ability to easily show the code associated with a certain visualization. For instance, if a client is interested in the generation of the Age Distribution graph, clicking on the Go to command icon appearing in the top right corner will automatically switch to the notebook view in the exact location of the command. This tool provides a convenient method to demonstrate both results and code without aimlessly scrolling through several lines or copying and pasting only small, dispersed snippets. It also provides data scientists the convenience of easily locating specific code for editing purposes.

salary class 2

salary class 3

Data Modeling

Dashboards can also be created to present the method and results of the data model. For clients, this aids in making complex analyses more tangible. Data scientists gain benefit from having all the key results organized in one place, an extremely useful feature especially if additional analysis will be performed at a later date.

In the Data Analysis dashboard below, we can see that logistic regression was applied to the salary classification problem, as well as the breakdown of the training and testing datasets. The Feature Selection graph illustrates that only 13 of the original 14 features are considered in the model. These features are then listed on the right side for easy reference.

The next step in creating a model is finding the optimal parameter values. For logistic regression, this is the C value, or a penalty parameter, that reduces overfitting. The Optimal C Parameter chart compares both the area under the curve and accuracy scores for various C values between 1 and 3. For both score types, C = 2.75 provided the highest value. Using this parameter, the model is then evaluated using 10-fold cross-validation. From the K-fold box in the dashboard, we can see the model has both a high accuracy and area under the curve score. The Confusion Matrix table also illustrates the model’s hits and misses.

Finally, the resulting model is also easily shared on the dashboard through listing the variables’ coefficients and their associated p-values. Markup command cells can also be added to the dashboard to create an overall summary of the findings as in the Conclusion box below.

logistic regression

Once a dashboard is completed it can be shared via a URL by clicking on the Present Dashboard tab in the left side pane. From this URL, users can examine specific data points by hovering over the charts, reorder data in tables, and even update the dashboard to reflect the most recent code in the notebook. This capability of user interaction and live data connection creates sophisticated presentations that are both interesting and beneficial for clients, more so than static PowerPoints or Word documents.

data analysis

Overall, Azure Databricks offers data scientists the potential to both easily analyze and present big data. The dashboard’s ability to interactively display results, text, and code in an easily tangible and succinct manner not only saves data scientists time and resources but also provides clients with an interesting view of the data and process. Viewing an analysis in this manner fosters greater client understanding and reception of all the steps required in machine learning, including data exploration, modeling, tuning, and model evaluation. Even the most accurate and thorough model can be declined by stakeholders, if its presentation fails to accurately reflect its value, which is why the dashboard is such a crucial, complementary feature to the overall processing ability of Databricks.

Contact BlueGranite today to learn more about how your organization can capitalize upon the data-driven insights of the dashboards available with Azure Databricks.

New call-to-action
Megan Quinn

About The Author

Megan Quinn

Megan has expertise in statistical analysis and machine learning as well as statistical theory. Her recent focus has been centered on predictive maintenance for military fleets with a background in education research as well. She is knowledgeable in a variety of analytical tools including Python, R, SQL, and most recently Spark & Databricks.

Latest Posts

New call-to-action