Skip to main content

Python PySpark: How to Select Columns by Type in PySpark

When working with PySpark DataFrames that contain many columns of mixed types, you often need to select only columns of a specific data type - for example, extracting all numeric columns for statistical analysis, all string columns for text processing, or all timestamp columns for time-series operations.

In this guide, you will learn two methods to select columns by their data type in PySpark: using dtypes with string matching and using schema.fields with type checking. Both approaches are demonstrated with clear examples and reusable patterns.

Setting Up the Example DataFrame

Let's create a DataFrame with columns of different data types to demonstrate the selection techniques:

from pyspark.sql import SparkSession
from pyspark.sql.types import (
StructType, StructField,
IntegerType, StringType, FloatType, DoubleType
)

spark = SparkSession.builder.appName("select_by_type").getOrCreate()

data = [
(1, "Sravan", 9.8, 4500.00),
(2, "Ojaswi", 9.2, 6789.00),
(3, "Bobby", 8.9, 988.00),
]

schema = StructType([
StructField("ID", IntegerType(), True),
StructField("Name", StringType(), True),
StructField("GPA", FloatType(), True),
StructField("Fee", DoubleType(), True),
])

df = spark.createDataFrame(data, schema)
df.show()
df.printSchema()

Output:

+---+------+---+------+
| ID| Name|GPA| Fee|
+---+------+---+------+
| 1|Sravan|9.8|4500.0|
| 2|Ojaswi|9.2|6789.0|
| 3| Bobby|8.9| 988.0|
+---+------+---+------+

root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- GPA: float (nullable = true)
|-- Fee: double (nullable = true)

Method 1: Using dtypes with String Matching

The dtypes property returns a list of (column_name, type_string) tuples. You can filter this list using startswith() to find columns matching a specific type string.

Type String Keywords

PySpark TypeString Keyword
IntegerType"int"
LongType"bigint"
StringType"string"
FloatType"float"
DoubleType"double"
BooleanType"boolean"
DateType"date"
TimestampType"timestamp"

Selecting Columns by Type

# View all column types
print("Column types:", df.dtypes)

# Select only integer columns
int_cols = df.select(
[col_name for col_name, col_type in df.dtypes if col_type.startswith("int")]
)
int_cols.show()

# Select only string columns
str_cols = df.select(
[col_name for col_name, col_type in df.dtypes if col_type.startswith("string")]
)
str_cols.show()

# Select only float columns
float_cols = df.select(
[col_name for col_name, col_type in df.dtypes if col_type.startswith("float")]
)
float_cols.show()

# Select only double columns
double_cols = df.select(
[col_name for col_name, col_type in df.dtypes if col_type.startswith("double")]
)
double_cols.show()

Output:

Column types: [('ID', 'int'), ('Name', 'string'), ('GPA', 'float'), ('Fee', 'double')]

+---+
| ID|
+---+
| 1|
| 2|
| 3|
+---+

+------+
| Name|
+------+
|Sravan|
|Ojaswi|
| Bobby|
+------+

+---+
|GPA|
+---+
|9.8|
|9.2|
|8.9|
+---+

+------+
| Fee|
+------+
|4500.0|
|6789.0|
| 988.0|
+------+

Selecting All Numeric Columns

To select all numeric columns (integer, float, double, long), you can check against multiple type strings:

numeric_types = ("int", "bigint", "float", "double")

numeric_cols = df.select(
[col_name for col_name, col_type in df.dtypes if col_type.startswith(numeric_types)]
)
numeric_cols.show()

Output:

+---+---+------+
| ID|GPA| Fee|
+---+---+------+
| 1|9.8|4500.0|
| 2|9.2|6789.0|
| 3|8.9| 988.0|
+---+---+------+
tip

Using startswith() with a tuple of strings is a clean way to match multiple types at once. This is especially useful when you need "all numeric columns" without listing every specific numeric type individually.

Method 2: Using schema.fields with isinstance()

The schema.fields property returns a list of StructField objects, each containing the column name and its dataType object. You can use Python's isinstance() to check the type more precisely.

Selecting Columns by Type

from pyspark.sql.types import IntegerType, StringType, FloatType, DoubleType

# Select integer columns
int_cols = df.select(
[field.name for field in df.schema.fields if isinstance(field.dataType, IntegerType)]
)
int_cols.show()

# Select string columns
str_cols = df.select(
[field.name for field in df.schema.fields if isinstance(field.dataType, StringType)]
)
str_cols.show()

# Select float columns
float_cols = df.select(
[field.name for field in df.schema.fields if isinstance(field.dataType, FloatType)]
)
float_cols.show()

# Select double columns
double_cols = df.select(
[field.name for field in df.schema.fields if isinstance(field.dataType, DoubleType)]
)
double_cols.show()

Output:

+---+
| ID|
+---+
| 1|
| 2|
| 3|
+---+

+------+
| Name|
+------+
|Sravan|
|Ojaswi|
| Bobby|
+------+

+---+
|GPA|
+---+
|9.8|
|9.2|
|8.9|
+---+

+------+
| Fee|
+------+
|4500.0|
|6789.0|
| 988.0|
+------+

Selecting All Numeric Columns with isinstance()

from pyspark.sql.types import IntegerType, LongType, FloatType, DoubleType

numeric_types = (IntegerType, LongType, FloatType, DoubleType)

numeric_cols = df.select(
[field.name for field in df.schema.fields if isinstance(field.dataType, numeric_types)]
)
numeric_cols.show()

Output:

+---+---+------+
| ID|GPA| Fee|
+---+---+------+
| 1|9.8|4500.0|
| 2|9.2|6789.0|
| 3|8.9| 988.0|
+---+---+------+

Creating a Reusable Helper Function

For frequent use, encapsulate the column selection logic in a reusable function:

from pyspark.sql import DataFrame
from pyspark.sql.types import DataType

def select_columns_by_type(df: DataFrame, *data_types) -> DataFrame:
"""
Select columns from a DataFrame that match any of the given PySpark data types.

Args:
df: Input PySpark DataFrame.
*data_types: One or more PySpark type classes (e.g., IntegerType, StringType).

Returns:
DataFrame containing only the matching columns.
"""
selected = [
field.name for field in df.schema.fields
if isinstance(field.dataType, tuple(data_types))
]
if not selected:
raise ValueError(f"No columns found matching types: {data_types}")
return df.select(selected)

Usage:

from pyspark.sql.types import IntegerType, FloatType, DoubleType, StringType

# Select all numeric columns
numeric_df = select_columns_by_type(df, IntegerType, FloatType, DoubleType)
numeric_df.show()

# Select only string columns
string_df = select_columns_by_type(df, StringType)
string_df.show()

Output:

+---+---+------+
| ID|GPA| Fee|
+---+---+------+
| 1|9.8|4500.0|
| 2|9.2|6789.0|
| 3|8.9| 988.0|
+---+---+------+

+------+
| Name|
+------+
|Sravan|
|Ojaswi|
| Bobby|
+------+

Excluding Columns by Type

Sometimes you need the opposite - select all columns except those of a certain type. Simply negate the condition:

# Select all non-string columns (using dtypes method)
non_string_cols = df.select(
[col_name for col_name, col_type in df.dtypes if not col_type.startswith("string")]
)
non_string_cols.show()

Output:

+---+---+------+
| ID|GPA| Fee|
+---+---+------+
| 1|9.8|4500.0|
| 2|9.2|6789.0|
| 3|8.9| 988.0|
+---+---+------+

Common Mistake: Empty Selection

If no columns match the specified type, you will get an error when calling select() with an empty list.

Problem:

from pyspark.sql.types import BooleanType

# No boolean columns exist in this DataFrame
bool_cols = [
field.name for field in df.schema.fields
if isinstance(field.dataType, BooleanType)
]

# This raises: AnalysisException: cannot resolve '[]'
df.select(bool_cols).show()

Solution - always check before selecting:

from pyspark.sql.types import BooleanType

bool_cols = [
field.name for field in df.schema.fields
if isinstance(field.dataType, BooleanType)
]

if bool_cols:
df.select(bool_cols).show()
else:
print("No boolean columns found in the DataFrame.")

Output:

No boolean columns found in the DataFrame.

Comparison of Methods

Aspectdtypes + String Matchingschema.fields + isinstance()
ApproachCompares type as a stringCompares type as a class
PrecisionMay match partial strings (e.g., "int" matches both int and bigint)Exact type matching
ReadabilitySimpler syntaxMore Pythonic, type-safe
Complex typesHarder to match (e.g., ArrayType, MapType)Handles complex types naturally
Best forQuick, simple type filteringProduction code, precise type checking

Summary

Selecting columns by data type in PySpark is essential for applying type-specific transformations or analyses. Key takeaways:

  • Use dtypes with startswith() for quick, string-based type filtering - ideal for simple cases and interactive exploration.
  • Use schema.fields with isinstance() for precise, type-safe filtering - better for production code and complex type hierarchies.
  • Pass a tuple of types to startswith() or isinstance() to match multiple types at once (e.g., all numeric types).
  • Always check for empty results before calling select() to avoid runtime errors.
  • Consider wrapping the logic in a reusable helper function for consistent use across your codebase.