Ames Housing Data: an introspection

Statty Spice
7 min readMay 12, 2021
Iowa, probably

My data science cohort made its first foray into the world of Kaggle the other week. Kaggle is an online data science community with members all over the world. Not to be confused with Kegel, a male doctor in the 1940’s who invented vagina flexing (finally). Kaggle boasts tons of resources for discussing and learning about data science, but the real coup d’état is that it hosts machine learning and predictive analytics competitions. These are valuable opportunities for novices and experts alike to build skills in a low-risk environment, made infinity more fun by a public leaderboard, prizes, and secret-holdout-test-data so the true winner isn’t known until the clock runs out.

We used Kaggle to participate in a closed competition with just our class, using a classic training set: Ames Iowa Housing Data. 80ish features of homes in Ames, and their sale price; from which you are to build a model that will predict the price of a list of un-priced homes.

I managed to squeak out a win in the class competition. Is it a big deal? Well, I’m drafting a tattoo about it, but for a discreet place because humility is a virtue. Admittedly it feels a lot like the first time I settled Catan. I won that, too, at a table of practiced Settlers, but mostly because I took inane risks since I didn’t fully understand the rules and was less than sober.

When it came to making sense of the Ames data, I didn’t have to reinvent sliced bread. This exact competition had been done on the public forum years before, and mockups like ours are used frequently for data science training. The best part about the Kaggle community is that winners just can’t shut up. Loads of talented data scientists posting notebooks and tutorials documenting theory and code behind their methods. Five stars. If I’m lucky, this bloject may be that useful to somebody one day. (Not today)

I have no wisdom to share about modeling housing data that can’t be found in some kickass ditties by P. Marcelino, J. Solal, and A. Papiu. (Seriously, check them out). What I can offer is a brief explanation of why these data scientists employed the strategies that they did, in three instances:

  1. Transform non-normal dependent variable

Why transform? On the left is the distribution of our target values (what we’re trying to predict, here saleprice). It is clearly not normal, and has a distinct right skew. This means the mean and median are both greater than the mode, and likely the mean may even be greater than median (mean > median > mode). So what? Well your central tendencies are a little out of whack, and that can lead to bias. The tail extending to the right acts like an outlier, and outliers aren’t great for regression.

In log scale, errors in predicting expensive houses and errors in predicting cheap houses will register as more equal errors. With the untransformed data, a 5% error for a $10,000 house (shack?), and a 5% error for a $1,000,000 house has a much different dollar effect; with log-transformed data, 5% is 5%. Thus, some insist that normality itself isn’t the goal of log transformation, but this additivity. Regardless, everyone’s for it.

As seen above, once log-transformed, the distribution of all of our known target values now resemble a normal distribution. When you get to predicting values, just a simple exponentiation gets you back into your original units.

Log transformation is a great go-to for positive skew caused by a few large values. There are other transformations you can use if log doesn’t get you there. See arcsine transformations for data representing percentages, and Box-Cox transformation — the father of log transformation — for user-defined power transformations.

1876 caricature of John Maddison Morton, author of a comedy called ‘Box and Cox’, which has absolutely nothing to do with statisticians George Box and Sir David Roxbee Cox. source: wikipedia

2. Transform other skewed numeric variables

So we agree that transforming your dependent variable can help eliminate skew and its adverse impacts on regression. But what about independent features? Turns out you can transform until the cows come home, and you can even use different transformations for different features. Not recommended, but you can; just keep straight all your tinkering so you can interpret things later. I followed Papiu’s and Solal’s lead to log transform skewed numerical features, anything with |skew| > 0.5.

There are few hard and fast rules about what can be transformed, and the statistics community could argue for hours about what should be transformed. General wisdom says transforming categoricals gets weird — and intuitively, it is awkward to rescale and transform something that’s been binarized. But sometimes I like to throw general wisdom to the wind: you can’t count your chickens before making an omelet. I transformed all skewed variables — continuous, discrete, and dummies — just to test the waters. And you know what?

Don’t do it. Didn’t help, kinda hurt. Generations of statisticians are on to something.

Log-transforming continuous numeric variables that met the mentioned skew threshold greatly improved my RMSE. But be aware that resulting coefficients will need some rephrasing when interpreting.

3. Regularized Regression

Regularization is a form of regression that shrinks coefficients toward zero. Is your model showing high variance? Regularize. A new concept to me at the time of this Kaggle foray, and now a favorite. Who needs loud, gaudy coefficients striving to capture the nuances of each and every data point? Pipe down, coefficients, minimalist is in right now. Constraining coefficients reduces overfitting, so your model will be more generalizable to unseen data.

The two types of regularization I’ll discuss here are ridge regression and LASSO regression, which shrink coefficients by applying a type of penalty, with weight λ.

Recall that the ordinary least squares (OLS) is a statistical procedure to find the best fit for data by minimizing the mean squared error (MSE), which can be written like this:

Ridge regression adds a penalty to this equation, and the coefficient lambda (λ) tells us how much weight that penalty has.

If λ=0, we’re working with regular ol’ MSE. As λ increases, the penalty term becomes more severe. Since we’re trying to minimize this whole situation, as λ grows, β (coefficient) must diminish to counteract this. This is known as ‘L2’ regularization.

LASSO regression is similar but instead of a squared magnitude of coefficient as the penalty term, it uses the absolute value of magnitude of coefficient; known as ‘L1’ penalty:

LASSO is actually an acronym; I haven’t just been yelling. It stands for Least Absolute Shrinkage and Selection Operator. Increasing λ shrinks feature coefficients toward 0, and higher λ actually zeros out small coefficients (selection). There are much better resources for mathier explanations (such as this write-up by Prashant Gupta) , but long story short, LASSO regression results in actual 0's for less important feature coefficients. Ridge coefficients only approach 0. With this regression task, LASSO worked best: less impactful features? Toss ’em. With 80+ features, keeping all the features — even with ridge’s reduced coefficients — was too noisy and the model performed better by actually zeroing them out.

A main lesson I learned through implementing ridge and LASSO here is that both performed best when given all of the features to work with. Unregularized linear and logistic regression can benefit from manual feature selection based on correlation. But these regularization methods performed best when given all the data and then being left to their own devices for shrinking and selecting. Math is better than your gut.

Elastic net regularization is another common method, but it was unknown to me at the time of the Ames competition so it gets no credit here. It considers a ratio of LASSO penalty to ridge penalty, so if neither one just is quite right individually, this could be your sweet spot. Not to be confused with G-spot, a female erogenous zone invented in the 1950s by physician Ernst Grafenberg (finally).

Thank you for your service.

So there you have it. The linked notebooks in the intro are much more thorough and provide wonderful coding guidance. But if they leave you asking why, then perhaps this overview can be of assistance. I may be an amateur data scientist, but as of writing this my Kaggle record is 100% W’s. Now that this is posted I can get on with ruining that record.

--

--

Statty Spice
0 Followers

‘The more you { }, the less you { }’.format( ‘know’ , np.random.choice( [ ‘know ’ , ‘need to say’ ], p = [ .8, .2 ] ) )