Fix Small Files, Hadoop, and Spark

How to faster list and delete files in Azure Databricks

  • 3 minutes to read

scenario

Suppose you need to drop a table that is partitioned by,,, and. However, the table is very large and shows approximately 1000 files per partition. You can list all the files in each partition and then delete them using an Apache Spark job.

For example, suppose you have a table partitioned by a, b, and c:

Listing files

With this function you can list all part files:

The function takes a path and a path as arguments, scans the files and matches the pattern, and then returns all of the sheet files matched as a sequence of strings.

The function also uses the utility function from the package. This function lists all paths in a directory with the specified prefix and does not list any further leaf child elements (files). The list of paths is passed to the method. This is an internal Spark API for distributed file collection.

Neither of these collection utility functions work well. By combining, you can run a list of the top-level directories you want to list with the globpath function with the driver, and you can distribute the listing for all the child leaves of the top-level directories using.

The speed can be sped up by 20 to 50X depending on Amdahl's right. This is because you can easily control the path according to the physical layout of the file and control the parallelism for.

Deleting files

If you are deleting files or partitions from an unmanaged table, you can use the Azure Databricks utility function. This feature takes advantage of the native cloud storage file system API which is optimized for all file operations. However, it is not possible to delete a gigantic table directly.

You can efficiently list files using the above script. For smaller tables, the collected paths of the files to be deleted are inserted into the driver memory so that you can use a Spark job to distribute the file delete task.

In the case of huge tables, the string representations of the file paths for a single top-level partition may not fit in the driver memory. The easiest way to fix this problem is to recursively capture the paths of the inner partitions, list the paths, and delete them in parallel.

The code deletes internal partitions while making sure that the partition being deleted is small enough. It does this by recursively searching the partitions through each level and only deleting them when the level you set is reached. For example, if you want to start deleting the top-level partitions, use. Spark will delete all files under, then delete the pattern until it is exhausted.

The Spark job distributes the delete task using the function shown above and lists the files with the assumption that the number of child partitions at this level is low. You can also become more efficient by replacing the function with the one shown above with only minor changes.

Summary

These two approaches highlight methods for listing and deleting gigantic tables. They use some Spark helper functions and functions that are specific to the Azure Databricks environment. Even if you cannot use them directly, you can create your own utility functions to solve the problem analogously.