Spark
Frequently used code for spark related code snippets
Create Spark session:
#!pip install pyspark
#!pip install findspark
#import os
#os.environ["SPARK_HOME"] = "/opt/spark-3.0.0" # might be required
#import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[*]") \
.appName("Learning_Spark") \
.getOrCreate()
sc = spark.sparkContext # Sometimes requiredCreate dataframe from json:
import json
jsonStrings = [json.dumps(x, ensure_ascii=False) for x in json_array]
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()Concatenate dataframe
Print schema
Select columns
Create new columns
Print dataframe
Print descriptive statistics for dataframe
Apply SQL to dataframe
Return a new row per each element in an array in column
Return a single string per row in groupby
Filter rows
Result will be a spark dataframe
Match rows with regex
Create empty columns with type
detect null values in column
Fill null values
Read/write parquet files
Stratified sample
Transform to dict
Filter columns with/without nulls
Group by
Group by column A keep the rows that are max according to column B
Group by (count nulls)
Order by
Alias
This help us complement the withColumn functionality in cases where the computation is result of spark transformation.
Null processing
Remove all rows with at least a null value
Remove rows with a least X number of non-null values
Check only given columns
Fill na values
Example classification
NLP tools
Tokenization
Stop words
TFIDF
References:
Last updated
Was this helpful?