DATA SCIENCE SESSIONS VOL. 1 :: 2021/22 :: Introduction to Data Science in R

September 29, 2021 Events, R, Training

Hey: we are sharing some twenty open, free R notebooks to learn Data Science in R here!

Our big educational gig is back: DATA SCIENCE SESSIONS VOL. 1 :: 2021/22, a 24 weeks, intense, introductory Data Science course in R will begin in October 2021.

The course material is GPLv2 licensed, which means that you can use it for free. DataKolektiv charges only our time in direct work with our students.

In DataKolektiv we do not offer any self-paced courses. In this course, you will be working directly with Goran S. Milovanović, Phd, owner of DataKolektiv, expert Data Scientist, and full-stack R developer who provides analytics services for some of the most complex, big datasets in the World, with more than 20 years of experience in Data Science and Analytics. We will begin in October 2021, so make sure to get in touch and enroll.

Figure 1: Start simple – the Sampling Distribution of the Mean.

And no, in DataKolektiv we do not believe that one can learn Data Science by investing 2 -3 hours of self-paced work weekly. We are very sorry to dissapoint many, but that is simply not possible. The weekly workload here is: 3h of tuition, at least 1h of labs, 1:1 sessions with the lecturer upon request monthly, and a minimum of 4h of individual work.

Figure 2: Likelihood Function – a fundamental concept in Estimation Theory.

The course encompasses 24 sessions, organized so to provide everything from an elementary introduction to R, RStudio, basic mathematical statistics, data wrangling, through data visualization, working with databases from R, advanced concepts in estimation theory, Linear and Generalized Linear Models, Decision Trees, Random Forests, and Model Selection techniques, towards reporting in R markdown and interactive visualizations. In other words, this course is a thorough, intensive introduction to Data Science in R, designed to meet the needs of a student who is determined to enter the field, providing rock-solid foundations to support future individual development. Several labs are there to support your learning and more will be added until the end of the year.

Figure 3: Understanding Entropy.

We especially encourage students of a non-technical, non-STEM background to apply. Your lecturer has studied mathematics, philosophy, and psychology, holds a Phd in psychology, and has years of experience as a full-stack R developer, providing software engineering in R from back-end interactions with Big Data systems (Spark, Hadoop), through advanced Machine Learning, and towards front-end development in R in production-grade code. So, it takes a lot of work to learn Data Science – but it is certainly doable. A non-STEM background will do just fine, but investing 2 – 3 hours weekly in a self-paced course alone will probably not.

The course is focused on Supervised Learning in Prediction and Classification problems.

Figure 4: Understanding Model Selection: an ROC analysis .

Here is an overview of the DATA SCIENCE SESSIONS VOL. 1 :: 2021/22 course with links to R markdown notebooks for each respective session:

  • Session 00. Installations, organization & intro readings [Notebook]
  • Session 01. Feeling R: Basics + Intuitive Understanding of R [Notebook]
  • Session 02.Installing R packages, I/O + a deep dive into the data.frame class [Notebook]
  • Session 03. Control flow + Functional programming in R [Notebook]
  • Session 04. Functions and vectorization. Overview: R programming. Some non-tabular data representations: XML and JSON [Notebook]
  • Session 05. Vector and matrix arithmetic. Strings and text: {stringr} [Notebook]
  • Session 06. Exploratory Data Analysis (EDA) + {ggplot2} [Notebook]
  • Session 07. Introduction to Probability Theory in R. Random Variables + Probability Functions [Notebook]
  • Session 08. More Probability Theory in R + Serious data wrangling with {dplyr} and {tidyr} [Notebook]
  • Session 09. The Relational Data Model w. {dplyr} + Statistical Hypothesis testing from the χ2 Distribution [Notebook]
  • Session 10. Working with a local RDBS from {dplyr} and {DBI} + t-test for unpaired samples [Notebook]
  • Session 11. Mastering {data.table}: efficient operations on large datasets. Probability: Conditional Probability. The Bayes’ Theorem [Notebook]
  • Session 12. Introduction to Estimation Theory: understanding the logic of statistical modeling. Introduction to covariance, correlation, and Simple Linear Regression [Notebook]
  • Session 13. Introduction to Estimation Theory. Partial and part correlation. The logic of statistical modeling elaborated. Enters the Sim-Fit loop. The bias of a statistical estimate. Parametric bootstrap [Notebook]
  • Session 14. Introduction to Estimation Theory. Multiple Linear Regression. Model diganostics. The role of part correlation in this model. Dummy coding of categorical variables in R [Notebook]
  • Session 15. The logic of statistical modeling explained: optimize the Simple Linear Regression model from scratch. Understanding why statistics = learning: the concept of error minimization [Notebook]
  • Session 16. Generalized Linear Models I. Binary classification problems: enters Binomial Logistic Regression. Probability Theory: a Maximum Likelihood Estimate (MLE) [Notebook]
  • Session 17. Generalized Linear Models II.Multinomial Logistic Regression for classification problems. ROC analysis for classification problems. Maximum Likelihood Estimation (MLE) revisited [Notebook]
  • Session 18. Generalized Linear Models III. Poisson regression. Negative binomial regression. Cross-validation in Regression problems [Notebook]
  • Session 19. Cross-validation in classification problems. An introduction to Decision Trees: complicated classification problems and powerful solutions. Postpruning of a Decision Tree model [Notebook]
  • Session 20. Classification and Regression Tress (CART) w. {rpart}. Elements of Information Theory for Classification Trees. Pre-pruning and post-pruning (revisited) of Decision Trees [Notebook]
  • Session 21. Random Forests and Cross-Validation (revisited) [Notebook]
  • Session 22. Regularization (L1 and L2) in regression problems. Running R code in parallel w. {snowfall} for efficient data processing and modeling [Notebook]
  • Session 23. Exit workshop 1: wrapping it up + reporting on your Data Science/Analytics project w. R markdown + interactive visualizations with {plotly} and {leaflet}
  • Session 24. Exit workshop 2: wrapping it up + reporting on your Data Science/Analytics project w. R markdown + interactive visualizations with {plotly} and {leaflet}

Join me to learn R together!

Goran S. Milovanović, Phd
DataKolektiv 2021.

R-bloggers

Goran S. Milovanović, Ph.D, Chief Scientist & Owner of DataKolektiv. Data Scientist for Wikidata since 2017.