Here is an overview of (some of) the data sets that we will be using in DATA SCIENCE SESSION VOL. 2 :: Introduction to R for Data Science.


Datasets used in this course

Inside Airbnb

This is a collection of frequently updated public Airbnb data sets which are nicely suited to practice basic data visualization and Exploratory Data Analysis (EDA).

Wikimedia Foundation Product Analytics/Comparison datasets

Data collected by Wikimedia Foundation’s Product Analytics team on the development of different language versions of Wikipedia, the free encyclopedia.

UCLA Statistical Methods and Data Analysis - LOGIT REGRESSION data set

A classic binary classification problem: predict a binary response variable admit from gre, gpa, and rank.

Household Size in the Philippines case study data set from Beyond Multiple Linear Regression: Applied Generalized Linear Models and Multilevel Models in R, Paul Roback and Julie Legler

An excellent data set to practice Poisson Regression from a classic GLM book in R.

Kaggle: The Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. We will use it to practice Random Forest models for regression problems.

Kaggle: AirQualityUCI

Predict Air Quality from the data recorede by a gas multisensor device deployed on the field.

Kaggle: Fish market

Database of common fish species for fish market: build a predictive model to estimate if the weight of fish can be predicted.

UCI Machine Learning repository: Wine Quality data set

The goal of the exercise in which we use the Wine Quality dataset is to train a regularized Multinomial Regression model to predict the wine quality class.

Kaggle: Bank Customer Churn Prediction

The task is to predict the Exited variable, making this pretty much a churn prediction problem.

UCI Machine Learning Repository: Online News Popularity Data Set. E

The task is to predict the web popularity of a post: the number of shares a post receives once it is published.


Additional Resources

Rdatasets: Rdatasets is a collection of 1892 datasets which were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.


License

Everything is GPL v2.0. Some code might be GPLv3.


Goran S. Milovanović, Chief Scientist & Owner, DataKolektiv, Lead Data Scientist, Smartocto
Contact: goran.milovanovic@datakolektiv.com. This is free software: all content is GPL v2.0 licensed.