Another week, another new Databricks Runtime.

Runtime 8.2 brings some nice functionality around operational metrics, but the big star of the week is the new Schema Inference & Evolution functionality available through Autoloader.

In this video, Simon takes a look through simple schema inference, applying schema hints and watching the schema metadata evolve through the various config options available!

Azure Spot VMs are incredibly cheap CPUs that come with the risk of being evicted if enough demand for full-price CPUs occurs in the region.

Luckily, Spark is a resilient distributed system that can easily handle replacing nodes, and so we’re left with a very cost effective approach to provisioning lower-priority workloads!

In this video, Simon walks through the process for provisioning a cluster using Spot VM workers, how to get to the lower-level configuration and some of the gotchas to be aware of.

With all the things coming out in Azure Databricks recently,  Advancing Analytics is starting a monthly roundup of the platform updates and any runtime additions.

March 2021 has seen a whole load of new features, from the GA of Runtime 8.0 AND 8.1, Spot VM bidding, workspace limit lifting and more. Check out this month’s video for the details.

As usual, platform release notes can be found here: https://docs.microsoft.com/en-gb/azure/databricks/release-notes/product/2021/march

Advancing Analytics takes a closer look at the two new runtimes available for Databricks.

We have not just one but two new Databricks Runtimes currently in preview – 7.6 brings several new features focussing on making Autoloader more flexible, improving performance of Optimize and Structured Streaming.

Runtime 8.0 is a much wider change, seeing the shift to Spark 3.1, introducing new language versions for Python, Scala and R.

This shift brings a large swathe of functionality, performance and feature changes, so take some time to look through the docs.

Simon walks through the high level notes, pulling out some interesting features and improvements.

Simon from Advancing Analytics explores the Atlas API that’s exposed under the covers of the new Azure Purview data governance offering.

There are a couple of different libraries available currently, so don’t be surprised if we see a lot of shifts & changes as the preview matures!

In this video, Simon takes a look at how you can get started with the API in a Databricks Notebook to register a custom lineage between two entities

For more info on the pyapacheatlas library used, see: https://pypi.org/project/pyapacheatlas/

NLP is a key component in many data science systems that must understand or reason about text. This hands-on tutorial uses the open-source Spark NLP library to explore advanced NLP in Python.

Spark NLP provides state-of-the-art accuracy, speed, and scalability for language understanding by delivering production-grade implementations of some of the most recent research in applied deep learning. It’s the most widely used NLP library in the enterprise today.

You’ll edit and extend a set of executable Python notebooks by implementing these common NLP tasks: named entity recognition, sentiment analysis, spell checking and correction, document classification, and multilingual and multi domain support. The discussion of each NLP task includes the latest advances in deep learning used to tackle it, including the prebuilt use of BERT embeddings within Spark NLP, using tuned embeddings, and “post-BERT” research results like XLNet, ALBERT, and roBERTa. Spark NLP builds on the Apache Spark and TensorFlow ecosystems, and as such it’s the only open-source NLP library that can natively scale to use any Spark cluster, as well as take advantage of the latest processors from Intel and Nvidia. You’ll run the notebooks locally on your laptop, but we’ll explain and show a complete case study and benchmarks on how to scale an NLP pipeline for both training and inference.

On the latest episode of Data Brew, Denny Lee talks to Michael Armbrust about Delta Lake.

Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

For our “Demystifying Delta Lake” session, we will interview Michael Armbrust – committer and PMC member of Apache Spark™ and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and Delta Lake.