Data Lakes are notoriously bad at single record lookups, the kind of query where you are looking for a specific ID in amongst millions of records.

Eouldn’t it be great if we could just pop an index over the top to speed this type of operation up?

Turns out we can!

In this video Simon runs through a quick introduction to using Bloom Filter Indexes with Databricks Delta.

Azure Spot VMs are incredibly cheap CPUs that come with the risk of being evicted if enough demand for full-price CPUs occurs in the region.

Luckily, Spark is a resilient distributed system that can easily handle replacing nodes, and so we’re left with a very cost effective approach to provisioning lower-priority workloads!

In this video, Simon walks through the process for provisioning a cluster using Spot VM workers, how to get to the lower-level configuration and some of the gotchas to be aware of.

With all the things coming out in Azure Databricks recently,  Advancing Analytics is starting a monthly roundup of the platform updates and any runtime additions.

March 2021 has seen a whole load of new features, from the GA of Runtime 8.0 AND 8.1, Spot VM bidding, workspace limit lifting and more. Check out this month’s video for the details.

As usual, platform release notes can be found here:

For a long time, we’ve all had to make do with several workarounds for integrating Databricks into our application lifecycle — syncing single notebooks, or pulling down entire directories via the CLI, a very manual process.

This week, Databricks released “Repos for Git”, which allows for whole repositories to be cloned and managed within the Databricks workspace.

In today’s video, Simon takes a look at how you enable the repos preview and follows a simple feature branching workflow – adding a notebook, updating a notebook and following the pull request back into our primary collaboration branch.

For more info on the repos release, check it out here:  k 

Databricks hosted this tech talk where, in part one, where they talk about how they built the engagement activity Delta Lake to support Einstein Analytics for creating powerful reports and dashboards and Sales Cloud Einstein for training machine learning models.

Salesforce customers are using High Velocity Sales to intelligently convert leads and create new opportunities. To support it, we built the engagement activity platform to automatically capture and store user engagement activities using delta lake, which is one of the key components supporting Einstein Analytics for creating powerful reports and dashboards

From the onset of the COVID-19 pandemic, educational institutions had to quickly make the shift to teaching fully online.

In this episode, Kate Carruthers, Chief Data and Insights Officer at the University of New South Wales Sydney, discusses how she’s helping transform the university into a data-driven organization.

Kate and her team are delivering new insights to instructors and students, rapidly moving pilot applications to production, and creating innovative ways to combat new threats that challenge the sanctity of the code of ethics between students and the university.

This talk is brought to you by the Istanbul Spark Meetup.


This live coding session is a gentle introduction to the latest and greatest of Delta Lake (

You will learn what Delta Lake is and what challenges it aims to solve. You will hear about how Delta Lake builds upon the features of the recent Apache Spark 3 and why it can complement your data processing workloads.

During this talk, Jacek will talk about the slogan from the main page of Delta Lake: “Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.”

You will learn about time travel and data versioning using Spark tables in Spark SQL and Spark Structured Streaming.

The Data + AI Summit 2021 Call for Presentations is closing soon.

Submit your full-length session ideas, lightning talk ideas, and more for the world’s largest gathering of Data + AI practitioners.

The conference is at the end of May, but the CFP is due on Sunday, February 28.

Data engineering, data analytics, AI, data science, machine learning, and more.