python - walk through a huge set of files but in a more efficient manner

  • Last Update :
  • Techknowledgy :

You should make use of the in keyword to test if a directory name matches a keyword.

for _, dirnames, _ in os.walk(START_FOLDER):
   for name in dirnames:
   if any((k in name.lower() for k in FOLDER_WITH_KEYWORDS_DELETION_EXCEPTION_LIST)):
   ignoreList.append(name)

Suggestion : 2

Python’s built-in os.walk() is significantly slower than it needs to be, because – in addition to calling os.listdir() on each directory – it executes the stat() system call or GetFileAttributes() on each file to determine whether the entry is a directory or not.,Like listdir, scandir calls the operating system’s directory iteration system calls to get the names of the files in the given path, but it’s different from listdir in two ways:,As part of this proposal, os.walk() will also be modified to use scandir() rather than listdir() and os.path.isdir(). This will increase the speed of os.walk() very significantly (as mentioned above, by 2-20 times, depending on the system).,As a case in point that shows the non-symlink-following version is error prone, this PEP’s author had a bug caused by getting this exact test wrong in his initial implementation of scandir.walk() in scandir.py (see Issue #4 here).

scandir(path = '.') - > generator of DirEntry objects
def subdirs(path):
   ""
"Yield directory names not starting with '.' under given path."
""
for entry in os.scandir(path):
   if not entry.name.startswith('.') and entry.is_dir():
   yield entry.name
def get_tree_size(path):
   ""
"Return total size of files in given path and subdirs."
""
total = 0
for entry in os.scandir(path):
   if entry.is_dir(follow_symlinks = False):
   total += get_tree_size(entry.path)
else:
   total += entry.stat(follow_symlinks = False).st_size
return total
def get_tree_size(path):
   ""
"Return total size of files in path and subdirs. If
is_dir() or stat() fails, print an error message to stderr
and assume zero size(
   for example, file has been deleted).
""
"
total = 0
for entry in os.scandir(path):
   try:
   is_dir = entry.is_dir(follow_symlinks = False)
except OSError as error:
   print('Error calling is_dir():', error, file = sys.stderr)
continue
if is_dir:
   total += get_tree_size(entry.path)
else:
   try:
   total += entry.stat(follow_symlinks = False).st_size
except OSError as error:
   print('Error calling stat():', error, file = sys.stderr)
return total
it = os.scandir(path)
while True:
   try:
   entry = next(it)
except OSError as error:
   handle_error(path, error)
except StopIteration:
   break

Suggestion : 3

The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.,Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:,You can see some example Spark programs on the Spark website. In addition, Spark includes several samples in the examples directory (Scala, Java, Python, R). You can run Java and Scala examples by passing the class name to Spark’s bin/run-example script; for instance:,Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are three recommended ways to do this:

groupId = org.apache.spark
artifactId = spark - core_2 .12
version = 3.3 .0
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;
    install_requires = [
       'pyspark=={site.SPARK_VERSION}'
    ]
from pyspark
import SparkContext, SparkConf