Create Spark Session

Save & Load SparkML Model

Objectives


In this section we will:

  • Create a simple Linear Regression Model
  • Save the SparkML model
  • Load the SparkML model
  • Make predictions using the loaded SparkML model

Setup


# if not installed already
pip install pyspark
pip install findspark

# Import general spark libraries
import findspark

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

# Import Spark ML libraries
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

findspark.init()

Create Spark Session & Context

# Create a spark context class
sc = SparkContext()

# Create a spark session
spark = SparkSession \
    .builder \
    .appName("Saving and Loading a SparkML Model").getOrCreate()
    
    
# Check context
sc
# __ OUTPUT
SparkContext
Spark UI
Versionv2.4.3Masterlocal[*]AppNamepyspark-shell

Analyze


Create DF with Sample Data

# Create a simple data set of infant height(cms) weight(kgs) chart.
mydata = [[46,2.5],[51,3.4],[54,4.4],[57,5.1],[60,5.6],[61,6.1],[63,6.4]]
  
# Mention column names of dataframe
columns = ["height", "weight"]
  
# creating a dataframe
mydf = spark.createDataFrame(mydata, columns)
  
# show data frame
mydf.show()

+------+------+
|height|weight|
+------+------+
|    46|   2.5|
|    51|   3.4|
|    54|   4.4|
|    57|   5.1|
|    60|   5.6|
|    61|   6.1|
|    63|   6.4|
+------+------+

Convert Columns to Feature Vectors

We use the VectorAssembler() function to convert the dataframe columns into feature vectors.

  • For our example, we use the height of the car as input features and
  • The weight as target labels
assembler = VectorAssembler(
    inputCols=["height"],
    outputCol="features")

data = assembler.transform(mydf).select('features','weight')

data.show()
+--------+------+
|features|weight|
+--------+------+
|  [46.0]|   2.5|
|  [51.0]|   3.4|
|  [54.0]|   4.4|
|  [57.0]|   5.1|
|  [60.0]|   5.6|
|  [61.0]|   6.1|
|  [63.0]|   6.4|
+--------+------+

Create & Train LR Model

# Create a LR model
lr = LinearRegression(featuresCol='features', labelCol='weight', maxIter=100)
lr.setRegParam(0.1)
# Fit the model
lrModel = lr.fit(data)

Save LR Model

lrModel.save('infantheight2.model')

Load LR Model

# You need LinearRegressionModel to load the model
from pyspark.ml.regression import LinearRegressionModel

model = LinearRegressionModel.load('infantheight2.model')

Make Predictions


1- Predict W from H

Predict the weight of an infant whose height is 70 CMs.

Create Function

# This function converts a scalar number into a dataframe that can be used by the model to predict.
def predict(weight):
    assembler = VectorAssembler(inputCols=["weight"],outputCol="features")
    data = [[weight,0]]
    columns = ["weight", "height"]
    _ = spark.createDataFrame(data, columns)
    __ = assembler.transform(_).select('features','height')
    predictions = model.transform(__)
    predictions.select('prediction').show()

Predict

predict(70)

+-----------------+
|       prediction|
+-----------------+
|7.863454719775907|
+-----------------+

2- Predict W from H

Let’s rename the model another name and predict the weight for another height, 50cms

Rename LR Model

  • We just use the save command with a new name
lrModel.save('babyweightprediction.model')

Load Model

model = LinearRegressionModel.load('babyweightprediction.model')

Predict W

predict(50)

+------------------+
|        prediction|
+------------------+
|3.4666826711164465|
+------------------+