Overview


A real estate company that manages properties around Iowa, United States wishes to improve its methods for pricing homes. Data is readily available on a number of measures, including the size of the home and property, year it was built, number of rooms and Zone where it is located.


Problem Statement


The major problem faced by the seller’s in the real estate industry is to get the correct prices of different properties. Different properties situated in the same area might have different prices. A similar kind of problem is faced by the buyer as they don’t understand on which factors these prices have been decided.


Our Approach


To solve the problem, we use INTELLIHUB . Lets see how INTELLIHUB works:

  • First user needs to uploads the train as well as test data.
  • It is stored in the cloud.
  • Data from cloud then goes to ML (a Machine Learning Module in INTELLIHUB ). In ML, you have 3 ML Libraries (Weka, H2O & Scikit) to choose from.
  • Using ML, you train a model.
  • You can evaluate using RMSE, accuracy and other different metrics.
  • You can then use the above model for prediction.


Data Definition


The data set contains information about 1461 houses that are situated in Iowa. Here the dataset includes about 11 columns. The variables in the data set are:

  • Id: Unique house ID
  • ZoneType: Identifies the general zoning classification of the sale.
  • LotArea: Lot size in square feet
  • YearBuilt: Original construction year.
  • YearRemodAdd: Remodel year (same as construction year if no remodeling or additions).
  • TotalBsmtSF: Total square feet of basement area.
  • AboveGrLiveAr: Above grade (ground) living area square feet.
  • TotalBathroom: Total number of bathrooms.
  • TotalRooms: Total number of rooms available.
  • ParkingSpace: Number of parking spaces available.
  • SalePrice: Selling price of the property. (Label)

Intellihub gives the flexibility of choosing features upto 20 columns.



Model Building


This problem involves one response variable, the selling price of the home, and various potential predictors of selling price. Few of the predictor variables measure, directly or indirectly, the size of the house: square feet, number of bedrooms and the number of bathrooms. We will be using Multiple linear regression (MLR) to determine a mathematical relationship among a number of random variables. In other terms, MLR examines how multiple independent variables are related to one dependent variable.

While doing our modeling with ML we will be using some terminologies:

  • Training Data : The data you use for training your model. It contains all the information you have collected about the problem statement.
  • Test Data : The data you use for testing the model. You can make predictions on this data.
  • Features : These are all the columns in your dataset which you use for training your model.
  • Class/Label : This is the column which identifies the particular record and the one you want to predict.

First we need to create an App from console and enable API for IntelliHubML.In Intellihub, we can build model:

  • Using SDK (For Developer)


Using SDK


If you want to develop a model using SDK, just copy API key from INTELLIHUB Console.




Connect To IntelliHub

Description

You can access the services provided by enabling API for ML. INTELLIHUB provides IntellihubClient where you have to pass your APP KEY as an argument.

Code


import intellihub

c = intellihub.IntellihubClient("YOUR API KEY")
 

import com.spotflock.IntellihubClient;

IntellihubClient c = new IntellihubClient("YOUR API KEY");
 

Upload Train And Test Files

Description

As Intellihub is a cloud platform, It stores Train and Test Files remotely. File upload API will return file storage locations from Cloud Storage in response.

Upload Train File


train_file_store_response = c.store("path/to/train/file")

train_data = train_file_store_response["fileUrl"]
 

JSONObject train_file_store_response = c.store("path/to/test/file");
JSONObject  train_data= train_file_store_response("fileUrl");
System.out.println(train_data.toString());
 

train_data file url

'/spotflock-studio-prod/xxxxx@xxxxxxxx.com/1551936734455-Housing_Train.csv'

Upload Test File


test_file_store_response = c.store("path/to/test/file")

test_data = test_file_store_response["fileUrl"]
 

JSONObject test_file_store_response = c.store("path/to/test/file");
JSONObject test_data = test_file_store_response("fileUrl");
System.out.println(test_data.toString());
 

test_data file url

'/spotflock-studio-prod/xxxxx@xxxxxxx.com/1551936725437-Housing_Test.csv'

Regression Model

Description

This API would enable you to train a regression model. The model takes some time to be trained and thus the job status has to be checked. Once the job is completed, the job output API would give you the model info.

Arguments

lib Library for training the model. Currently we are supporting Spotflock and weka libraries.
service Valid parameter values are classification and regression.
model_name Model name and with this name model will be saved.
algorithm Algorithm by which model will be trained.
dataset_url Train dataset file location in INTELLIHUB storage.
label Label of the column in train dataset file.
train_percentage % of data will be used for training and model will be tested against remaining % of data.
features Column names list which is used to train regression model.
save_model If True, model will be saved.

Code


train_response = c.train("regression", "LinearRegression", train_data, "SalesPrice",
                         ["YearBuilt","YearRemodAdd","TotalBsmtSF","AboveGrLiveAr",
                         "TotalBathroom","TotalRooms","ParkingSpace"]
                          ,"Housing - Linear Regression","weka", 80, True)
 

train_response

{'code': 0,
 'data': {'jobId': 435,
  'appId': 1555944250593,
  'name': 'weka_regression_train',
  'library': 'weka',
  'service': 'Regression',
  'task': 'TRAIN',
  'state': 'RUN',
  'startTime': '2019-04-29T04:12:46.090+0000',
  'endTime': None,
  'request': {'library': 'weka',
   'config': {'datasetUrl': '/spotflock-studio/xxxxx@gmail.com/1556511156024-Housing_Train.csv',
    'algorithm': 'LinearRegression',
    'saveModel': True,
    'label': 'SalePrice',
    'features': ['YearBuilt',
     'YearRemodAdd',
     'TotalBsmtSF',
     'AboveGrLiveAr',
     'TotalBathroom',
     'TotalRooms',
     'ParkingSpace'],
    'name': 'Housing - Linear Regression',
    'trainPercentage': 80,
    'params': {}}}}}
lib Library for training the model. Currently we are supporting Spotflock and weka libraries.
service Valid parameter values are classification and regression.
modelName Model name and with this name model will be saved.
algorithm Algorithm by which model will be trained.
datasetUrl Train dataset file location in INTELLIHUB storage.
label Label of the column in train dataset file.
trainPercentage % of data will be used for training and model will be tested against remaining % of data.
features Column names list which is used to train regression model.
saveModel If True, model will be saved.

Code


JSONArray features = new JSONArray();
features.put("YearBuilt"),
features.put("YearRemodAdd"),
features.put("TotalBsmtSF"),
features.put("AboveGrLiveAr"),
features.put("TotalBathroom"),
features.put("TotalRooms"),
features.put("ParkingSpace")


JSONObject params = new JSONObject();
params.put("lib","weka");
params.put("saveModel",true);
params.put("trainPercentage",70);
params.put("modelName","Housing - Linear Regression");

String response = c.train("regression","LinearRegression", trainData, 
"SalesPrice ", features, params); JSONObject trainResponse = new JSONObject(response);

trainResponse


{"code": 0,
 "data": {"jobId": 436,
  "appId": 1555944250593,
  "name": "weka_regression_train",
  "library": "weka",
  "service": "Regression",
  "task": "TRAIN",
  "state": "RUN",
  "startTime": "2019-04-29T04:12:46.090+0000",
  "endTime": null,
  "request": {"library": "weka",
   "config": {"datasetUrl": "/spotflock-studio/xxxxx@gmail.com/1556511156024-Housing_Train.csv",
    "algorithm": "LinearRegression",
    "saveModel": true,
    "label": "SalePrice",
    "features": ["YearBuilt",
     "YearRemodAdd",
     "TotalBsmtSF",
     "AboveGrLiveAr",
     "TotalBathroom",
     "TotalRooms",
     "ParkingSpace"],
    "name": "Housing - Linear Regression",
    "trainPercentage": 80,
    "params": {}}}}}

Get Train Job Status

Description

The train/predict jobs take some amount of time to be completed and so their status can be checked with this API.

Code


train_job_status_response = c.job_status(train_response["data"]["jobId"])
 

JSONObject trainJobStatusResponse = c.jobStatus(trainResponse.getJSONObject("data")
.get("jobId")); System.out.println(trainJobStatusResponse.toString());

train_job_status_response


{
  "jobId":438,
  "appId":1555944250593,
  "name":"weka_regression_train",
  "library":"weka",
  "service":"Regression",
  "task":"TRAIN",
  "state":"FINISH",
  "startTime":"2019-04-29T05:55:00.499+0000",
  "endTime":"2019-04-29T05:55:02.234+0000"
}

Get Train Job Output

Description

Once the job status is completed, the job output can be retrieved from this API.

Code


train_job_output_response = c.job_output(train_response["data"]["jobId"])
 

JSONObject trainJobOutputResponse = apiKey.jobOutput(trainResponse.getJSONObject("data")
.get("jobId")); System.out.println(trainJobOutputResponse.toString());

Evaluation Metrics

errorRate Error Rate is the ratio of total number of incorrectly predicted instances to total number of instances.
pearsonCorrelation Evaluates the worth of an attribute by measuring the pearsonCorrelation between it and the target.

Response

{'id': 382,
 'jobId': 435,
 'output': {'eval': {'errorRate': 37553.968063766835,
   'pearsonCorrelation': {'YearBuilt': 0.5217288200486135,
    'TotalRooms': 0.5304982050516467,
    'TotalBsmtSF': 0.6178120742263836,
    'ParkingSpace': 0.6422093932720816,
    'YearRemodAdd': 0.5115401835644273,
    'AboveGrLiveAr': 0.6995247520043848,
    'TotalBathroom': 0.6075329437393489}},
  'modelUrl': '/spotflock-studio-prod/22/1556511166745-Housing_-_Linear_Regression_5172345025555246511.mdl'}}

Get Model Url

Description

After Train job is finished, you can get the model url.

Code

`

model = train_job_output_response["output"]["modelUrl"]
 

String model = trainJobOutputResponse.getJSONObject("output").get("modelUrl");
 

By Printing model

'/spotflock-studio-prod/22/1552024436737-Sales_Data_Model_6647825745227784853.mdl'

Predict on Test Data

Description

The below code is to predict on test data by passing the model url that was obtained from previous response.

Code


predict_response = c.predict("weka", "regression", test_data, model)
 

predict_response


params = new JSONObject();
params.put("lib","weka");
JSONObject predictResponse = c.predict("regression", testData, model, params);
 

predictResponse

{'code': 0,
 'data': {'jobId': 436,
  'appId': 1555944250593,
  'name': 'weka_regression_predict',
  'library': 'weka',
  'service': 'Regression',
  'task': 'PREDICT',
  'state': 'RUN',
  'startTime': '2019-04-29T04:13:33.187+0000',
  'endTime': None,
  'request': {'library': 'weka',
   'config': {'modelUrl': '/spotflock-studio-prod/22/1556511166745-Housing_-_Linear_Regression_5172345025555246511.mdl',
    'params': {},
    'datasetUrl': '/spotflock-studio-prod/anurag@spotflock.com/1556511162346-Housing_Test.csv'}}}}

Get Prediction Job Status

Description

The train/predict jobs take some amount of time to be completed and so their status can be checked with this API.

Code


predict_job_status_response = c.job_status(predict_response["data"]["jobId"])
 

predict_job_status_response


JSONObject predictJobStatusResponse = c.jobStatus(predictResponse.getJSONObject("data")
.get("jobId"));

predictJobStatusResponse


{"jobId":436,"appId":1555944250593,"name":"weka_regression_predict","library":"weka","service":"Regression","task":"PREDICT","state":"FINISH","startTime":"2019-04-29T04:13:33.187+0000","endTime":"2019-04-29T04:13:39.239+0000"}

Get Prediction Job Output

Description

Once the job status is completed, the job output can be retrieved from this API.

Code


predict_job_output_response = c.job_output(predict_response["data"]["jobId"])
 

predict_job_output_response


JSONObject predictJobOutputResponse = c.jobOutput(predictResponse.getJSONObject("data")
.get("jobId")); System.out.println(predictJobOutputResponse.toString());

predictJobOutputResponse

{'id': 383,
 'jobId': 436,
 'output': {'reqId': 436,
  'predFileUrl': '/spotflock-studio-prod/22/1556511218179-prediction.csv'}}

Get Prediction File Url

Description

Once the Predict job is completed, get the prediction file url.

Code


pred_file = predict_job_output_response['output']['predFileUrl']
 

String pred_file = predictJobOutputResponse.getJSONObject("output").get("predFileUrl")
 

pred_file

'/spotflock-studio/22/1551864223344-prediction.csv'

Download Prediction File

Description

You can download the predicted file as csv by using below code.

Code


prediction_response = c.download(pred_file)
import io
import pandas as pd
df = pd.read_csv(io.StringIO(prediction_response.text))
df.to_csv('pred_file.csv')
 

JSONObject predictionResponse = c.download(pred_file);
FileWriter outputfile = new FileWriter(ENTER YOUR OUTPUT FILE PATH);
CSVWriter writer = new CSVWriter(outputfile);
writer.writeAll(prediction_response.toString());
writer.close();
 


Summary


  • This model we can predict the selling prices of different houses as well as we know the factors on which these prices depends.
  • This model will help both sellers as they can bargain for a relatable and fair price and sellers will get an idea of the area , number of bedrooms, location and how much they need to pay for the same.
  • Studio can be used to train in various other models of classification like Logistics, MultilayerPerceptron, NaiveBayesMultinomial, RandomForest, LibSVM, AdaBoostM1, AttributeSelectedClassifier, Bagging, CostSensitiveClassifier, DecisionTable,GaussianProcesses, IBk, RandomTree, SMO.