Microsoft’s HDInsight service lets users scale and manage Hadoop, Spark, R, Hbase and Storm in a simple interface. It’s a robust and popular service but has been due an upgrade for a while now.
Today Microsoft announced some major additions to the platform, including Spark 2.0, LLAP Hive support, new security features and Zeppelin notebook support.
According to Microsoft, this update gives HDInsight the“highest levels of security for authentication, authorization, auditing and encryption available in the cloud for Hadoop.”
One of the ways the company has achieved this is through encryption. Information in the Azure Data Lake Store or Azure storage now has the option for encryption. Customers can do this with no extra config, with service-managed keys or ones from the Azure Key Vault.
In addition, HDInsight now supports Apache Ranger. This provides users with a robust policy management portal that lets you analyze audit records. Azure Active Directory and Domain Services have also been implemented, resulting in fast and secure identity management.
The new security features will be coming with the public preview in October.
Hive and LLAP
Though HDInsight has Hive integration already, it hasn’t had access to the Long Lived and Process initiative until now. Microsoft is the first Cloud Hadoop to support LLAP, and has seen speed increases of up to 25 times.
It also brings some other improvements, such as better mapjoin vectorization, smarter mapjoins, and a fully vectorized pipeline.
Microsoft’s support for Spark 2.0 is also a big deal, as it comes with some huge advancements. The core query engine has been overhauled, more SQL syntax is supported, and the streaming engine and machine learning pipelines are both better.
Furthermore, Microsoft and Hortonworks collaborated to bring over a hundred fixes, as well as a Spark-HBase connector. The result is a tenfold increase in performance, as well as more stability and the ability to perform cache-efficient vectorized computations.
Zepplin Notebook and Third-party ISV apps
Finally, HDInsight now comes with Zeppelin notebook support, as well as Cask and StreamSets. The notebook integration makes it easier for data scientists to combine code, equations, and visualizations in a narrative way.
Cask, on the other hand, is a data pipeline tool. It provides users with “a self-service, extendable open source framework to visually develop, run, automate and operate data pipelines.” StreamSets builds on this by bringing easier data flow management.
In total, it’s a huge update and brings some great features. With the exception of security, all of them are available today. You can find more detail on the Azure blog.