Create Spark Session

Save & Load SparkML Model

Objectives

In this section we will:

Create a simple Linear Regression Model
Save the SparkML model
Load the SparkML model
Make predictions using the loaded SparkML model

Setup

# if not installed already
pip install pyspark
pip install findspark

# Import general spark libraries
import findspark

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

# Import Spark ML libraries
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

findspark.init()

Create Spark Session & Context

# Create a spark context class
sc = SparkContext()

# Create a spark session
spark = SparkSession \
    .builder \
    .appName("Saving and Loading a SparkML Model").getOrCreate()
    
    
# Check context
sc
# __ OUTPUT
SparkContext
Spark UI
Versionv2.4.3Masterlocal[*]AppNamepyspark-shell

Analyze

Create DF with Sample Data

# Create a simple data set of infant height(cms) weight(kgs) chart.
mydata = [[46,2.5],[51,3.4],[54,4.4],[57,5.1],[60,5.6],[61,6.1],[63,6.4]]
  
# Mention column names of dataframe
columns = ["height", "weight"]
  
# creating a dataframe
mydf = spark.createDataFrame(mydata, columns)
  
# show data frame
mydf.show()

+------+------+
|height|weight|
+------+------+
|    46|   2.5|
|    51|   3.4|
|    54|   4.4|
|    57|   5.1|
|    60|   5.6|
|    61|   6.1|
|    63|   6.4|
+------+------+

Convert Columns to Feature Vectors

We use the VectorAssembler() function to convert the dataframe columns into feature vectors.

For our example, we use the height of the car as input features and
The weight as target labels

assembler = VectorAssembler(
    inputCols=["height"],
    outputCol="features")

data = assembler.transform(mydf).select('features','weight')

data.show()
+--------+------+
|features|weight|
+--------+------+
|  [46.0]|   2.5|
|  [51.0]|   3.4|
|  [54.0]|   4.4|
|  [57.0]|   5.1|
|  [60.0]|   5.6|
|  [61.0]|   6.1|
|  [63.0]|   6.4|
+--------+------+

Create & Train LR Model

# Create a LR model
lr = LinearRegression(featuresCol='features', labelCol='weight', maxIter=100)
lr.setRegParam(0.1)
# Fit the model
lrModel = lr.fit(data)

Save LR Model

lrModel.save('infantheight2.model')

Load LR Model

# You need LinearRegressionModel to load the model
from pyspark.ml.regression import LinearRegressionModel

model = LinearRegressionModel.load('infantheight2.model')

Make Predictions

1- Predict W from H

Predict the weight of an infant whose height is 70 CMs.

Create Function

# This function converts a scalar number into a dataframe that can be used by the model to predict.
def predict(weight):
    assembler = VectorAssembler(inputCols=["weight"],outputCol="features")
    data = [[weight,0]]
    columns = ["weight", "height"]
    _ = spark.createDataFrame(data, columns)
    __ = assembler.transform(_).select('features','height')
    predictions = model.transform(__)
    predictions.select('prediction').show()

Predict

predict(70)

+-----------------+
|       prediction|
+-----------------+
|7.863454719775907|
+-----------------+

2- Predict W from H

Let’s rename the model another name and predict the weight for another height, 50cms

Rename LR Model

We just use the save command with a new name

lrModel.save('babyweightprediction.model')

Load Model

model = LinearRegressionModel.load('babyweightprediction.model')

Predict W

predict(50)

+------------------+
|        prediction|
+------------------+
|3.4666826711164465|
+------------------+