Spark

Frequently used code for spark related code snippets

Create Spark session:

#!pip install pyspark
#!pip install findspark

#import os
#os.environ["SPARK_HOME"] = "/opt/spark-3.0.0" # might be required
#import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Learning_Spark") \
    .getOrCreate()
sc = spark.sparkContext # Sometimes required

Create dataframe from json:

import json
jsonStrings = [json.dumps(x, ensure_ascii=False) for x in json_array]
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()

Concatenate dataframe

Print schema

Select columns

Create new columns

Print dataframe

Print descriptive statistics for dataframe

Apply SQL to dataframe

Return a new row per each element in an array in column

Return a single string per row in groupby

Filter rows

Result will be a spark dataframe

Match rows with regex

Create empty columns with type

detect null values in column

Fill null values

Read/write parquet files

Stratified sample

Transform to dict

Filter columns with/without nulls

Group by

Group by column A keep the rows that are max according to column B

Group by (count nulls)

Order by

Alias

This help us complement the withColumn functionality in cases where the computation is result of spark transformation.

Null processing

Remove all rows with at least a null value

Remove rows with a least X number of non-null values

Check only given columns

Fill na values

Example classification

NLP tools

Tokenization

Stop words

TFIDF

References:

Last updated

Was this helpful?