Use a list comprehension:
cols_filtered = [ c for c in df.schema.names if not isinstance(df.schema[c].dataType, StructType) ]
Or,
# Thank you @pault
for the suggestion!
cols_filtered = [c
for c, t in df.dtypes
if t != 'struct'
]
Now, you can pass the result to df.select
.
df2 = df.select( * cols_filtered)
This uses an array string as an argument to drop() function. This removes more than one column (all columns from an array) from a DataFrame.,PySpark DataFrame provides a drop() method to drop a single column/field or multiple columns from a DataFrame/Dataset. In this article, I will explain ways to drop columns using PySpark (Spark with Python) example.,The above two examples remove more than one column at a time from DataFrame. These both yield the same output.,Below is a complete example of how to drop one column or multiple columns from a PySpark DataFrame.
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = (("James", "", "Smith", "36636", "NewYork", 3100), \
("Michael", "Rose", "", "40288", "California", 4300), \
("Robert", "", "Williams", "42114", "Florida", 1400), \
("Maria", "Anne", "Jones", "39192", "Florida", 5500), \
("Jen", "Mary", "Brown", "34561", "NewYork", 3000)\
)
columns = ["firstname", "middlename", "lastname", "id", "location", "salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
This yields below output.
root
|
--firstname: string(nullable = true) |
--middlename: string(nullable = true) |
--lastname: string(nullable = true) |
--id: string(nullable = true) |
--location: string(nullable = true) |
--salary: long(nullable = true)
PySpark drop()
takes self and *cols as arguments. In the below sections, I’ve explained with examples.
drop(self, * cols)
The above 3 examples drops column “firstname” from DataFrame. You can use either one of these according to your need.
root
|
--middlename: string(nullable = true) |
--lastname: string(nullable = true) |
--id: string(nullable = true) |
--location: string(nullable = true) |
--salary: long(nullable = true)
This uses an array string as an argument to drop() function. This removes more than one column (all columns from an array) from a DataFrame.
df.drop("firstname", "middlename", "lastname")\
.printSchema()
cols = ("firstname", "middlename", "lastname")
df.drop( * cols)\
.printSchema()
Column names and datatypes of a dataframe ,1st parameter 'field' is used to specify either name of the field/column or StructField object containing data type and nullable information. , Schema of a dataframe in tree format ,Example 1: schema attribute can be used on a dataframe to return schema of a dataframe as StructType object. df.schema Output: StructType(List(StructField(db_id,StringType,true), StructField(db_name,StringType,true),StructField(db_type,StringType,true)))
df = spark.read.csv("file:///path_to_files/csv_file_with_duplicates.csv", header = True)
df.show() +
-- -- - + -- -- -- -- - + -- -- -- - +
|
db_id | db_name | db_type |
+ -- -- - + -- -- -- -- - + -- -- -- - +
|
12 | Teradata | RDBMS |
|
14 | Snowflake | CloudDB |
|
15 | Vertica | RDBMS |
|
12 | Teradata | RDBMS |
|
22 | Mysql | RDBMS |
+ -- -- - + -- -- -- -- - + -- -- -- - +
df.printSchema()
Output:
root |
--db_id: string(nullable = true) |
--db_name: string(nullable = true) |
--db_type: string(nullable = true)
df.columns
Output: ['db_id', 'db_name', 'db_type']
column_name_list = df.columns # getting column list column_name_list.remove('db_type') # removing unwanted column from list df.select(column_name_list).show() # displaying data for all the remaining column + -- -- - + -- -- -- -- - + | db_id | db_name | + -- -- - + -- -- -- -- - + | 12 | Teradata | | 14 | Snowflake | | 15 | Vertica | | 12 | Teradata | | 22 | Mysql | + -- -- - + -- -- -- -- - +
df.dtypes
Output: [('db_id', 'string'), ('db_name', 'string'), ('db_type', 'string')]
df.schema
Output:
StructType(List(StructField(db_id, StringType, true), StructField(db_name, StringType, true), StructField(db_type, StringType, true)))
schema – a StructType or list of column names. default None.,Removes all cached tables from the in-memory cache.,When schema is None, it will try to infer the schema (column names and types) from data, which should be an RDD of Row, or namedtuple, or dict.,Removes the specified table from the in-memory cache.
>>> l = [('Alice', 1)] >>>
sqlContext.createDataFrame(l).collect()[Row(_1 = u 'Alice', _2 = 1)] >>>
sqlContext.createDataFrame(l, ['name', 'age']).collect()[Row(name = u 'Alice', age = 1)]
>>> d = [{
'name': 'Alice',
'age': 1
}] >>>
sqlContext.createDataFrame(d).collect()[Row(age = 1, name = u 'Alice')]
>>> rdd = sc.parallelize(l) >>>
sqlContext.createDataFrame(rdd).collect()[Row(_1 = u 'Alice', _2 = 1)] >>>
df = sqlContext.createDataFrame(rdd, ['name', 'age']) >>>
df.collect()[Row(name = u 'Alice', age = 1)]
>>> from pyspark.sql
import Row
>>>
Person = Row('name', 'age') >>>
person = rdd.map(lambda r: Person( * r)) >>>
df2 = sqlContext.createDataFrame(person) >>>
df2.collect()[Row(name = u 'Alice', age = 1)]
>>> from pyspark.sql.types
import *
>>>
schema = StructType([
...StructField("name", StringType(), True),
...StructField("age", IntegerType(), True)
]) >>>
df3 = sqlContext.createDataFrame(rdd, schema) >>>
df3.collect()[Row(name = u 'Alice', age = 1)]
>>> sqlContext.createDataFrame(df.toPandas()).collect()[Row(name = u 'Alice', age = 1)] >>>
sqlContext.createDataFrame(pandas.DataFrame([
[1, 2]
])).collect()[Row(0 = 1, 1 = 2)]