Data

Datasets used in this course

Inside Airbnb

This is a collection of frequently updated public Airbnb data sets which are nicely suited to practice basic data visualization and Exploratory Data Analysis (EDA).

Wikimedia Foundation Product Analytics/Comparison datasets

Data collected by Wikimedia Foundation’s Product Analytics team on the development of different language versions of Wikipedia, the free encyclopedia.

UCLA Statistical Methods and Data Analysis - LOGIT REGRESSION data set

A classic binary classification problem: predict a binary response variable admit from gre, gpa, and rank.

Household Size in the Philippines case study data set from Beyond Multiple Linear Regression: Applied Generalized Linear Models and Multilevel Models in R, Paul Roback and Julie Legler

An excellent data set to practice Poisson Regression from a classic GLM book in R.

Kaggle: The Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. We will use it to practice Random Forest models for regression problems.

Kaggle: AirQualityUCI

Predict Air Quality from the data recorede by a gas multisensor device deployed on the field.

Kaggle: Fish market

Database of common fish species for fish market: build a predictive model to estimate if the weight of fish can be predicted.

Multiple Linear Regression: House Sales in King County, USA.

Predict the pricing of a property.

UCI Machine Learning repository: Wine Quality data set

The goal of the exercise in which we use the Wine Quality dataset is to train a regularized Multinomial Regression model to predict the wine quality class.

Kaggle: Bank Customer Churn Prediction

The task is to predict the Exited variable, making this pretty much a churn prediction problem.

UCI Machine Learning Repository: Online News Popularity Data Set. E

The task is to predict the web popularity of a post: the number of shares a post receives once it is published.