BUSINESS INSIGHTS

Nov 09, 2017

Tips for Implementing Dev/Test Environments and DevOps with Azure Data Lake Projects

Josh Fennessy Posted by Josh Fennessy

“Deploying data platform solutions to Microsoft Azure is easy," they say. “Deploy Hadoop in 5 clicks or less!" or "Create a Data Lake in under 3 minutes!” While such claims are powerful and show the flexibility and simplicity of managing the infrastructure within Azure, they don't really speak to deploying complex, productionalized solutions that many large organizations are looking to implement.

azuredatalake.png

Solid solutions looking to enter modern production environments need to support two things up front:

  • DEV and/or TEST sandboxes: Utilizing a non-production environment to test, make changes, and implement new features is an essential function. A sandbox ensures a safe and proper testing solution without impacting the day-to-day production environment.
  • Deployment Automation: Just as important as being safe and secure, software deployments need to be automated. Solutions that automate and deploy your applications across various environments improve the productivity of both the Dev and Ops teams. In addition, deployment automation ensures every build deploys the same, reduces human error, and allows more frequent updates addressing customer needs sooner, keeping organizations competitive.

On DevOps: “I know all about DEV/TEST environments surrounding DevOps. Our organization includes it for all IT deployments, including our data warehouse.” And that’s a fantastic start - managing multiple environments and deployments in an on-premises environment is important. However, with the proliferation of the cloud-based and platform-as-a-service (PaaS) products, there are a huge number of new options for building solutions. With that, managing code and environments in those solutions can be substantially different than what you may be used to in on-premises environments.

Here are a few tips to consider while creating new Azure Data Platform projects. 

1) Create a subscription for Dev/Test

  • When you create an Azure tenant, you’re automatically given an Azure subscription to use within that tenant. But you can create additional tenants to separate resources and security models.
  • Azure Dev/Test subscriptions are a powerful tool to help you create and test your projects, while controlling costs. Pricing within Azure Dev/Test is reduced for many of the common services that we use in data platform projects. 

2) Use a Cloud-based Source Control system

  • Even with platform-as-a-service implementations, someone still must write code to build and implement the solution. Often, enterprises have an in-house method for storing, versioning, and managing source code. For Azure Data Platform projects, this system should live in the cloud with the solution. The Microsoft Visual Studio Team Foundation Server (TFS) is a great place to store your Data Factory JSON files, HDInsight Hive and Spark projects, and even your PowerShell automation scripts and ARM templates.
  • In addition to being used for source control, TFS can be used to automate the build, test, and deploy process. If your Data Platform includes custom applications to connect to data APIs, or serve User Interfaces for data entry, you can configure TFS for Continuous Deployment to ensure timely, stable builds and deployments to your Azure-based infrastructure.

3) Use Visual Studio for development

  • While most of the Azure Data Platform services offer web-based development environments, a better choice for enterprise development is to use Visual Studio. Not only does Visual Studio integrate directly with TFS for easy management of source control (which is a great tie-in with the previous tip), Visual Studio also offers a set of features that enable the management of Dev/Test environments.
  • Azure Data Factory projects, for example, support multiple build environment options. This is important because using Visual Studio for this enables you to deploy new pipelines to your DEV/TEST environment without hard coding any values. Removing the need to "hard code" settings is important to enable stable deployment practices.
  • Azure HDInsight also has a rich set of project templates that can be used to expedite development and deployment of data platform projects.
  • Learn how to manage multiple Azure environments within Visual Studio with this link.

4) Only use Dev/Test resources when needed

  • One of the most important reasons to go the cloud is to save money. The main source of cost savings in the cloud is related to the fact that we only have to pay for infrastructure when we are using it. Usually, organizations are not developing solutions 24 hours a day. Testing doesn't happen continuously. It only happens at the end of a sprint and may be limited to only a few days. Using Azure Automation (PowerShell) and Azure Resource Manager (ARM) templates, you have complete control of which features and products are running within your Azure subscription at any given time.
    • For example, an HDInsight cluster used for development might cost $15 an hour to run. You have to pay this charge even when the cluster is idle. If you know that development does not occur after 6 p.m. and doesn't start up again until at least 9 a.m. the next day, you can create an ARM template and Automation script to deactivate the cluster in the evening. When the cluster is deactivated (deleted), you don't have to pay for it. Managing cost by only running infrastructure when you need it is something that only the cloud can give you.
  • Learn how to manage Azure with Automation and ARM templates with this link.

5) Utilize local sandboxes whenever possible

  • Just like with custom .NET applications, most developers like to develop their solutions using a local sandbox to ensure that they can test new code without impacting any other solution. With Azure Data Platform projects, this isn't always possible. Some of the products within Azure, like Azure Data Lake Analytics, support local mode execution. Others, like Azure Data Factory, don't have a local mode, meaning all development will need to occur within an Azure Dev/Test Subscription.

Sometimes, you can get creative and find other ways to enable local sandbox development. If you’re building Hive, Pig, or Spark applications with HDInsight, you could consider using a local copy of the Hortonworks Sandbox that matches the HDP version of HDInsight your organization is targeting. While this won't provide a 100% equivalent development environment, it will be close enough to write, test, and debug your application scripts prior to testing in the Azure environment. It may also mean the ability to further reduce cost by only requiring the HDInsight cluster to be running to test scripts already vetted in the local sandbox. If you don't have enough local resources to run the HDP Sandbox on your machine, you can even host it as a single virtual machine (VM) in the Azure cloud. While you'll have to pay for this VM, it will be cheaper to run per hour than an HDInsight cluster which consists of multiple virtual machines.

Developing Azure Data Platform solutions may require some modification to your current DevOps processes, however Microsoft has given us the tools to build cloud-based platform solutions while retaining a level of trust that builds and deployments will be safe and successful. Don't be afraid to try creative workarounds like running a local HDP Sandbox to emulate HDInsight to meet your organizational goals for development and deployment practices.

Just because you're moving to platform-as-a-service-based solutions, doesn't mean you have to sacrifice the safety and security of your current DevOps processes. For more information on these, and other tips for succeeding with Azure Data Platform projects, contact us today.

Azure Databricks Webinar
Josh Fennessy

About The Author

Josh Fennessy

Josh is a Solution Architect at BlueGranite. Josh is passionate about enabling information workers to become data leaders. His passions in the data space include: Modern Data Warehousing, unstructured analytics, distributed computing, and NoSQL database solutions.

Latest Posts

New Call-to-action