A Thorough Dive into the Ames Iowa Housing Dataset. Part 1 of 4.

11 min readFeb 16, 2021

Part 1: Dealing with null and non-numeric values.

The Ames Iowa housing dataset is a widely used dataset for teaching data science students fundamentals of regression analysis while providing good practice cleaning and exploring a dataset.

This post is a step by step walkthrough of preparing the Ames Iowa dataset for machine learning. At the end of this notebook, we will have prepared the dataset for feature engineering by removing all null values, verify data types, convert ordinal data into numeric types, and perform some preliminary EDA. We will be well on the way to having a data set that is ready to be processed in the machine learning algorithm of our choice!

Part 2 of this series contains additional EDA, feature engineering, and feature selection for modeling. Part 3 of this series is a walkthrough of the construction of various linear regression models (OLS, Ridge, Lasso, and ElasticNet), implementation of a data pipeline, and grid-search to optimize model hyperparameters. Part 4, the final part, will show you how to deploy the chosen model to a Streamlit online application. All code is included along with the final Github repository. The dataset is available in the Github repository at the bottom of this post so that the reader can follow along and reproduce this work. Due to the detail of this thorough deep dive of all features some of this may appear to be repetitive, but thoroughly examining your dataset is integral to the data science workflow and necessary to answer the chosen problem statement.

For this walk-through, you are a data scientist working for the fictitious Common Good Reality-based out of Ames, Iowa. You are a part of a team working on developing an online Home Buyers Toolbox to help make the home buying process easier for prospective home buyers. This is a very important task as research shows that the average home buyer visits 10 houses over 10 weeks before settling on the right home.[source] It is a massive financial and life decision as home purchasers stay in a newly purchased home for approximately 13 years. [source] The goal of this project is to reduce the time it takes for home-seekers who work with Common Good Reality to find the right fit home thereby increasing company sales, reducing time spent by our realtors on each customer, and increasing the company’s referral rate by providing a better home match for the budget of each customer.

You are given a data set that includes 2051 records of home sales over the past several decades and has 81 feature columns. Your goal is to create an accurate model with features specific to the online platform. In total, we will end up using 28 feature columns, many of which are engineered to meet the above problem statement, dropping only 13 records and achieving an R² of 92%. The complete data dictionary can be found here. Open up a copy of Jupyter Lab and follow along with the provided code snippets.

The first step is to import all necessary libraries. This should always be done at the top of your notebook.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns#set_style
plt.style.use('fivethirtyeight')#configure for retina display
%config InlineBackend.figure_format = 'retina'

Import the dataset using a relative path and give the data frame a descriptive name.

# Read in data
ames = pd.read_csv('../datasets/train.csv')#want to change columns names to lower/snake_case so a little easier to work with
ames.columns = ames.columns.str.lower().str.replace(' ', '_')#increase pandas default max_rows so I can view all columns at once
pd.options.display.max_rows = 250
pd.options.display.max_columns = 250

Then we will take a look at the shape, information, and summary statistics of the dataset.

ames.shape #81 columns and 2051 recordsames.info() # There are a lot of columnsames.describe().T #Summary statistics

The next step is to deal with null, aka missing, values.

#Investigate nulls
ames.isnull().sum()[0:90]

Lot Frontage:

ames['lot_frontage'].value_counts().head() #Linear feet of street connected to property

# Apply mask/filter to examine only null values.
lot_nulls = ames[ames['lot_frontage'].isna()] #lots with no linear feet of street connected to property# These appear to be real properties. Setting to average for the corrosponding type of lot_config.lot_nulls.head()

# Group lots by configuration
grouped_lots = ames.groupby(['lot_config']).agg({
   'lot_frontage' : 'mean', 
})grouped_lots

I decide to impute the missing values with the mean for that lot configuration type. Here are two other methods that were not explored in this notebook, but that may be worth examining including linear regression and K-means clustering.

#replace np.nan with average for lot_config of that type
ames.loc[(ames['lot_frontage'].isna()) & (ames['lot_config'] == 'Corner'), 'lot_frontage'] = 83.25
ames.loc[(ames['lot_frontage'].isna()) & (ames['lot_config'] == 'CulDSac'), 'lot_frontage'] = 54.734375        
ames.loc[(ames['lot_frontage'].isna()) & (ames['lot_config'] == 'FR2'), 'lot_frontage'] = 60.836735        
ames.loc[(ames['lot_frontage'].isna()) & (ames['lot_config'] == 'FR3'), 'lot_frontage'] = 87.000000        
ames.loc[(ames['lot_frontage'].isna()) & (ames['lot_config'] == 'Inside'), 'lot_frontage'] = 66.759571

Check to make sure all null values are taken care of:

ames[‘lot_frontage’].isna().sum()

Look for outliers:

# looking at outliers
ames['lot_frontage'].sort_values().tail()

Make a note of potential outliers to drop:

x_frontage_maybe_drop = ames[ames['lot_frontage'] == 313]
x_frontage_maybe_drop

Then move on to the next column feature, ‘Alley’:

ames[‘alley’].value_counts()

output

ames[‘alley’].isnull().sum() #1911 null values

Here is where consulting our data dictionary provides valuable insight. These values should be NA according to the data dictionary, meaning these houses have no alley. I will one hot encode (OHE)/dummy this variable later on. 1 if has alley, 0 if the house does not have an alley.

ames['alley'] = ames['alley'].replace(np.nan, 'NA')

A final examination of the feature:

ames['alley'].value_counts()

Next, mas_vnr_type & mas_vnr_area also known as Masonry veneer type and area.

print(ames['mas_vnr_type'].value_counts())
print(ames['mas_vnr_area'].value_counts())ames[ames['mas_vnr_type'].isna()].head()

Again, the data dictionary is our friend and we learn that the following code snippet will solve our missing value issue.

ames['mas_vnr_type'] = ames['mas_vnr_type'].replace(np.nan, 'None') #replace masonry np.nan with None
ames['mas_vnr_area'] = ames['mas_vnr_area'].replace(np.nan, 0) #replace masonry np.nan with 0

Next, we examine basement related variables:
* bsmt_qual (Evaluates the height of the basement),
* bsmt_cond (Evaluates the general condition of the basement),
* bsmt_exposure (Refers to walkout or garden level walls)
* bsmtfin_type_1
* bsmtfin_sf_1
* bsmtfin_type_2
* bsmtfin_sf_2
* bsmt_unf_sf
* total_bsmt_sf
* bsmt_full_bath
* bsmt_half_bath

ames['bsmt_qual'].value_counts() #missing NA for no basement

ames[['bsmt_qual', 'bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1', 'bsmtfin_sf_1', 'bsmtfin_type_2', 'bsmtfin_sf_2', 'bsmt_unf_sf', 'total_bsmt_sf','bsmt_full_bath', 'bsmt_half_bath']].isnull().sum()

#these all receive an NA if no basement
basement_to_na = [‘bsmt_qual’, ‘bsmt_cond’, ‘bsmt_exposure’, ‘bsmtfin_type_1’, 
 ‘bsmtfin_type_2’]#these receive 0sqft if basement is NA
basement_to_zero = [‘bsmtfin_sf_2’, ‘bsmt_unf_sf’, ‘total_bsmt_sf’, ‘bsmtfin_sf_1’, ‘bsmt_full_bath’,
 ‘bsmt_half_bath’]for i in basement_to_na:
 ames[i] = ames[i].replace(np.nan, ‘NA’)ames[ames['bsmtfin_sf_1'].isna()] #the missing records are for NA basement properties. Setting to sqft value to zero.#set basement sqft to 0 IF basement is NA, else drop as incomplete basement data and only 1 null record for each    
for i in basement_to_zero:
    ames[i] = ames[i].replace(np.nan, 0)

Next, fireplace_qu or Fireplace Quality.

print(ames['fireplace_qu'].isnull().sum()) # There are 1000 missing valuesprint(len(ames['fireplace_qu'])) # Remember we only have 2051 records. This is nearly half of the dataset

There are many ordinal values in this dataset. In order to use these features for modeling, we will have to convert this dtype=object to a numeric datatype, but only after dealing with all null values. It is very important to pay attention to datatypes throughout this process. Computers love numbers, but not much else. Everything eventually needs to be reduced to some numeric form. In data science, there are many ways to do this, even for complex text documents where Natural Language Processing (NLP) is applied. This dataset does not require any NLP, but we will have to transform ordinal features to numeric features in order for our machine learning algorithms to run properly.

ames['fireplace_qu'].unique()

output, an example of ordinal values

ames[‘fireplaces’].unique()

output, an example of discrete numeric values

Confirm that houses without fireplaces are also houses where fireplace quality is a missing value. It is a very good idea to use multiple columns (if possible) to get a better picture as to why information is missing in your dataset. In this case, we confirm that we can replace missing values with NA.

len(ames[(ames['fireplace_qu'].isna()) & (ames['fireplaces'] < 1)]) #number of houses w/o fireplace = fireplace_qu np.nan is 1000!

Perform the replacement:

ames['fireplace_qu'] = ames['fireplace_qu'].replace(np.nan, 'NA') #NA for 'No Fireplace'

Now for garage related feature columns:

garage_type
* garage_yr_blt
* garage_finish
* garage_cars
* garage_area
* garage_qual
* garage_cond

ames[['garage_type','garage_finish','garage_qual','garage_cond']].isnull().sum()

#create two lists, one for catagorical and one for numericgarage_to_na = ['garage_type','garage_finish','garage_qual','garage_cond']
garage_to_0 = ['garage_yr_blt','garage_cars', 'garage_area']

Replace ordinal columns with NA based on data dictionary.

for i in garage_to_na:
    ames[i] = ames[i].replace(np.nan, 'NA')

A closer look at the numeric garage features:

ames[(ames[‘garage_yr_blt’].isna()) & (ames[‘garage_type’] ==’NA’)] #113 rows, whats up with the missing 1 to make 114?garage_to_0.isnull().sum()

ames[ames['garage_cars'].isna()] #it has a garage type, but no more infor. I'm going to set this value to zero. This is an educated guess.

A final check, always a good idea to double-check your work.

ames[['garage_type','garage_finish','garage_qual','garage_cond','garage_yr_blt','garage_cars', 'garage_area']].isnull().sum()

Next, pool_qc

ames['pool_qc'].isnull().sum() #mostly empty column. There are only 2051 records and 2042 of these are null values.

There is little information gained from the pool features.

print(ames['pool_qc'].unique()) #almost no information on pool. Going to set to a default 'TA' for average/typical IF pool YES.
#Will probably end up dropping this column. May be important if pool dummy to YES/NO, but less so for pool quality.for i in ames[(ames['pool_qc'].isna()) & (ames['pool_area'] == 0)]:
    ames['pool_qc'] = ames['pool_qc'].replace(np.nan, 'NA')

Fence feature:

print(ames['fence'].isnull().sum()) # There are 1651 nulls for this feature#This replacement should feel like second nature by now.ames['fence'] = ames['fence'].replace(np.nan, 'NA')ames['fence'].isnull().sum() #Zero nulls remain

misc_feature:

ames['misc_feature'].isnull().sum() #not surprise most do not have a 'Tennis court' or 'Elevator' on their property... 1986 nulls here. for i in ames[(ames['misc_feature'].isna()) & (ames['misc_val'] == 0)]:
    ames['misc_feature'] = ames['misc_feature'].replace(np.nan, 'NA')ames['misc_feature'].isnull().sum() #no remaining null values

A final check:

ames.isnull().sum()

Next, we check the datatypes against the data dictionary to make sure our data frame is correctly reading the data and create lists of each data type for use in EDA.

ames_indexing = [['id', 'pid']]#ordinal data. 
#Will want to convert these to numeric lists. 
#Note:'overall_qual','overall_cond' are already numericames_ordinal = ames[['saleprice','lot_shape','utilities','land_slope','overall_qual','overall_cond','exter_qual', exter_cond','bsmt_qual', 'bsmt_cond', 'bsmt_exposure','bsmtfin_type_1','bsmtfin_type_2','heating_qc','electrical','kitchen_qual','functional', 'fireplace_qu','garage_finish', 'garage_qual','garage_cond','paved_drive','pool_qc','fence']]ames_nominal = ames[['ms_subclass','ms_zoning','street','alley','land_contour','lot_config','neighborhood','condition_1', 'condition_2', 'bldg_type','house_style','roof_style','roof_matl', 'exterior_1st', 'exterior_2nd','mas_vnr_type','foundation','heating','central_air','garage_type','misc_feature','sale_type'     
]]ames_catagorical = ames[['street','alley', 'lot_shape', 'land_contour','utilities', 'lot_config', 'land_slope', 'neighborhood','condition_1', 'condition_2', 'bldg_type','house_style','roof_style', 'roof_matl', 'exterior_1st', 'exterior_2nd', 'mas_vnr_type','exter_qual', 'exter_cond', 'foundation', 'bsmt_qual','bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1','bsmtfin_type_2','heating', 'heating_qc', 'central_air', 'electrical','kitchen_qual', 'functional', 'fireplace_qu','garage_type','garage_finish','garage_qual',
'garage_cond', 'paved_drive','pool_qc','fence', 'misc_feature','sale_type']]ames_numeric_catagory = ames[['saleprice','ms_subclass', 'ms_zoning', 'overall_qual','overall_cond', 'mo_sold'                ]]ames_numeric = ames[['saleprice','lot_frontage', 'lot_area','year_built','year_remod/add','mas_vnr_area','bsmtfin_sf_2', 'bsmt_unf_sf', 'total_bsmt_sf','1st_flr_sf', '2nd_flr_sf', 'low_qual_fin_sf', 'gr_liv_area', 'bsmt_full_bath',
 'bsmt_half_bath', 'full_bath', 'half_bath', 'bedroom_abvgr',
'kitchen_abvgr','totrms_abvgrd','fireplaces','garage_yr_blt','garage_cars', 'garage_area','wood_deck_sf', 'open_porch_sf', 'enclosed_porch', '3ssn_porch','screen_porch', 'pool_area','misc_val','yr_sold']]y = ames['saleprice'] # our target variable

Now for some initial Exploratory Data Analysis. Here we are looking to visually inspect our variables for insights that may not be readily apparent.

sns.pairplot(ames_numeric_catagory) #initial eda on numeric features

After this initial visual inspection, we see that summer months have more sales, that sales price (our target) has an obvious skew, and that more houses have an average overall condition than any other.

Let’s transform our ordinal feature columns into numbers so we can take a look at the correlations of these features as well.

#ordinal columns
ames_ordinal.head()

I choose a dictionary to replacement method because this is a simple mapping technique that a junior analyst can be easily trained to update in case Common Good Reality decides to expand this segment of the Home Buyers Toolbox to additional cities or update our dataset with recent house sales. What is happening is values with high quality, or weight, are assigned a larger number value than small numbers. Here the numbers are chosen based on a visual inspection of the distributions of each value within the feature column. A more sophisticated method looking at the standard deviation and distribution could be used to improve this section.

#ordinal data. Will want to convert these to ordinal lists.#columns I want to replace values in
ames_ordinal_str_columns = ames[['lot_shape','utilities','land_slope','exter_qual', 
                 'exter_cond','bsmt_qual',
       'bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1','bsmtfin_type_2','heating_qc','electrical','kitchen_qual',
                'functional','fireplace_qu','garage_finish','garage_qual','garage_cond','paved_drive','pool_qc','fence']]#dictionary of values I want to replace
dict_ordinal = {
    #overall_qual, overall_cond,exter_qual,exter_cond,bsmt_qual,bsmt_cond,bsmt_exposure,heatingqc,kitchenqual
    #firplacequ,garage_qual,garage_cond,pool_qc,fence
    'Ex': 5,
    'Gd': 4,
    'TA': 3,
    'Fa': 2,
    'Po': 1,
    'NA': 0,
     #lot_shape
    'Reg':4,
    'IR1':3,
    'IR2':2,
    'IR3':1,
    #utilities
    'AllPub':4,
    'NoSewr':3,
    'NoSeWa':2,
    'ELO':1,
    #land_slope
    'Gtl':3,
    'Mod':2,
    'Sev':1,
    #bsmt_exposure
    'Av': 3,
    'Mn': 2,
    'No': 1,
    #bsmtfin_type_1, bsmtfin_type_1
    'GLQ':6,
    'ALQ':5,
    'BLQ':4,
    'Rec':3,
    'LwQ':2,
    'Unf':1,
    #Electrical
    'SBrkr':1,
    'FuseA':0,
    'FuseF':0,
    'FuseP':0,
    'Mix':0,
    #Functional
    'Typ':1,
    'Min1':0,
    'Min2':0,
    'Mod':0,
    'Maj1':0,
    'Maj2':0,
    'Sev':0,
    'Sal':0,
    #garage_finish    
    'Fin': 3,
    'RFn': 2,
    'Unf':1,
    #paved_drive
    'Y' : 2,
    'P' : 1,
    'N' : 0,
    #fence
    'GdPrv':4,
    'MnPrv':3,
    'GdWo':2,
    'MnWw':1   
}#loop to get the job done
for i in ames_ordinal_str_columns.columns:
    ames.replace({i: dict_ordinal},inplace = True)

These feature columns should all be numeric now, lets take a look:

#visually inspect that the change took
ames[['lot_shape','utilities','land_slope','exter_qual', 'exter_cond','bsmt_qual','bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1','bsmtfin_type_2','heating_qc','electrical','kitchen_qual','functional','fireplace_qu','garage_finish','garage_qual','garage_cond','paved_drive','pool_qc','fence']].head(3)

Visually inspect the datatypes, all return as int64:

#checking to see if all dtypes now numeric
ames[['lot_shape','utilities','land_slope','exter_qual', 'exter_cond','bsmt_qual','bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1','bsmtfin_type_2','heating_qc','electrical','kitchen_qual','functional','fireplace_qu','garage_finish','garage_qual','garage_cond','paved_drive','pool_qc','fence']].dtypes

Now let’s do the same for the neighborhood feature. Intuitively this will be an important feature. Location matters when it comes to housing prices.

#I don't want to dummy neighboorhood, but I do want to bin it into oridinal columns based on mean price
ames['neighborhood_order'] = ames['neighborhood']
ames_ordinal_str_columns = ames[['neighborhood_order']]dict_ordinal = {
    'MeadowV': 1,
    'IDOTRR' : 1,
    'BrDale' : 1,
    'OldTown': 1,
    'BrkSide': 1,
    'Edwards': 1,
    'SWISU': 1,
    'Landmrk':2,
    'Sawyer':2,
    'NPkVill':2,
    'Blueste':2,
    'NAmes':2,
    'Mitchel':2,
    'SawyerW':2,
    'Greens':3,
    'Gilbert':3,
    'NWAmes':3,
    'Blmngtn':3,
    'CollgCr':3,
    'Crawfor':3,
    'ClearCr':3,
    'Somerst':4,
    'Timber':4,
    'Veenker':4,
    'GrnHill':4,
    'StoneBr':4,
    'NoRidge':4,
    'NridgHt':4,
        }#loop to get the job done
for i in ames_ordinal_str_columns.columns:
    ames.replace({i: dict_ordinal},inplace = True)

Note that I decided to split this into four separate bins, instead of a unique bin for all neighborhoods. This is a part of feature engineering and the portion of data science that is more art than science. It is also the perfect segue into the next part of this series.

Check out part two of this four-part series where we will continue our thorough dive into the Ames Iowa Dataset with additional EDA, feature engineering, and feature selection for modeling.

Github repository: https://github.com/jesservillines/Housing-Prices

Author’s Linkedin: https://www.linkedin.com/in/jesse-villines/

If you like this work please reach out directly on LinkedIn with questions or ideas for further collaboration!

A Thorough Dive into the Ames Iowa Housing Dataset. Part 1 of 4.

Part 1: Dealing with null and non-numeric values.

Written by Jesse Villines