DATA SCIENCE SESSIONS VOL. 1 :: 2021/22 :: Introduction to Data Science in R

September 29, 2021 Events, R, Training

Hey: we are sharing some twenty open, free R notebooks to learn Data Science in R here!

Our big educational gig is back: DATA SCIENCE SESSIONS VOL. 1 :: 2021/22, a 24 weeks, intense, introductory Data Science course in R will begin in October 2021.

The course material is GPLv2 licensed, which means that you can use it for free. DataKolektiv charges only our time in direct work with our students.

In DataKolektiv we do not offer any self-paced courses. In this course, you will be working directly with Goran S. Milovanović, Phd, owner of DataKolektiv, expert Data Scientist, and full-stack R developer who provides analytics services for some of the most complex, big datasets in the World, with more than 20 years of experience in Data Science and Analytics. We will begin in October 2021, so make sure to get in touch and enroll.

Figure 1: Start simple – the Sampling Distribution of the Mean.

And no, in DataKolektiv we do not believe that one can learn Data Science by investing 2 -3 hours of self-paced work weekly. We are very sorry to dissapoint many, but that is simply not possible. The weekly workload here is: 3h of tuition, at least 1h of labs, 1:1 sessions with the lecturer upon request monthly, and a minimum of 4h of individual work.

Figure 2: Likelihood Function – a fundamental concept in Estimation Theory.

The course encompasses 24 sessions, organized so to provide everything from an elementary introduction to R, RStudio, basic mathematical statistics, data wrangling, through data visualization, working with databases from R, advanced concepts in estimation theory, Linear and Generalized Linear Models, Decision Trees, Random Forests, and Model Selection techniques, towards reporting in R markdown and interactive visualizations. In other words, this course is a thorough, intensive introduction to Data Science in R, designed to meet the needs of a student who is determined to enter the field, providing rock-solid foundations to support future individual development. Several labs are there to support your learning and more will be added until the end of the year.

Figure 3: Understanding Entropy.

We especially encourage students of a non-technical, non-STEM background to apply. Your lecturer has studied mathematics, philosophy, and psychology, holds a Phd in psychology, and has years of experience as a full-stack R developer, providing software engineering in R from back-end interactions with Big Data systems (Spark, Hadoop), through advanced Machine Learning, and towards front-end development in R in production-grade code. So, it takes a lot of work to learn Data Science – but it is certainly doable. A non-STEM background will do just fine, but investing 2 – 3 hours weekly in a self-paced course alone will probably not.

The course is focused on Supervised Learning in Prediction and Classification problems.

Figure 4: Understanding Model Selection: an ROC analysis .

Here is an overview of the DATA SCIENCE SESSIONS VOL. 1 :: 2021/22 course with links to R markdown notebooks for each respective session:

  • Session 00. Installations, organization & intro readings [Notebook]
  • Session 01. Feeling R: Basics + Intuitive Understanding of R [Notebook]
  • Session 02.Installing R packages, I/O + a deep dive into the data.frame class [Notebook]
  • Session 03. Control flow + Functional programming in R [Notebook]
  • Session 04. Functions and vectorization. Overview: R programming. Some non-tabular data representations: XML and JSON [Notebook]
  • Session 05. Vector and matrix arithmetic. Strings and text: {stringr} [Notebook]
  • Session 06. Exploratory Data Analysis (EDA) + {ggplot2} [Notebook]
  • Session 07. Introduction to Probability Theory in R. Random Variables + Probability Functions [Notebook]
  • Session 08. More Probability Theory in R + Serious data wrangling with {dplyr} and {tidyr} [Notebook]
  • Session 09. The Relational Data Model w. {dplyr} + Statistical Hypothesis testing from the χ2 Distribution [Notebook]
  • Session 10. Working with a local RDBS from {dplyr} and {DBI} + t-test for unpaired samples [Notebook]
  • Session 11. Mastering {data.table}: efficient operations on large datasets. Probability: Conditional Probability. The Bayes’ Theorem [Notebook]
  • Session 12. Introduction to Estimation Theory: understanding the logic of statistical modeling. Introduction to covariance, correlation, and Simple Linear Regression [Notebook]
  • Session 13. Introduction to Estimation Theory. Partial and part correlation. The logic of statistical modeling elaborated. Enters the Sim-Fit loop. The bias of a statistical estimate. Parametric bootstrap [Notebook]
  • Session 14. Introduction to Estimation Theory. Multiple Linear Regression. Model diganostics. The role of part correlation in this model. Dummy coding of categorical variables in R [Notebook]
  • Session 15. The logic of statistical modeling explained: optimize the Simple Linear Regression model from scratch. Understanding why statistics = learning: the concept of error minimization [Notebook]
  • Session 16. Generalized Linear Models I. Binary classification problems: enters Binomial Logistic Regression. Probability Theory: a Maximum Likelihood Estimate (MLE) [Notebook]
  • Session 17. Generalized Linear Models II.Multinomial Logistic Regression for classification problems. ROC analysis for classification problems. Maximum Likelihood Estimation (MLE) revisited [Notebook]
  • Session 18. Generalized Linear Models III. Poisson regression. Negative binomial regression. Cross-validation in Regression problems [Notebook]
  • Session 19. Cross-validation in classification problems. An introduction to Decision Trees: complicated classification problems and powerful solutions. Postpruning of a Decision Tree model [Notebook]
  • Session 20. Classification and Regression Tress (CART) w. {rpart}. Elements of Information Theory for Classification Trees. Pre-pruning and post-pruning (revisited) of Decision Trees [Notebook]
  • Session 21. Random Forests and Cross-Validation (revisited) [Notebook]
  • Session 22. Regularization (L1 and L2) in regression problems. Running R code in parallel w. {snowfall} for efficient data processing and modeling [Notebook]
  • Session 23. Exit workshop 1: wrapping it up + reporting on your Data Science/Analytics project w. R markdown + interactive visualizations with {plotly} and {leaflet}
  • Session 24. Exit workshop 2: wrapping it up + reporting on your Data Science/Analytics project w. R markdown + interactive visualizations with {plotly} and {leaflet}

Join me to learn R together!

Goran S. Milovanović, Phd
DataKolektiv 2021.


DataKolektiv is online!

June 17, 2020 Events, General
DataKolektiv logo

DataKolektiv has its website up and running, finally. We have enough time just to say Hello World and disappear because we are found in the middle of preparations for our Semantic Web in R for Data Scientists workshop in the forthcoming e-Rum 2020 conference:

This workshop will offer a hands-on approach to Semantic Web technologies in R by exemplifying how to work with Wikidata and DBpedia in different ways. Attendees of the workshop should be R developers who understand the typical ways of dealing with familiar data structures like dataframes and lists. The workshop will be supported by a well documented, readable code in a dedicated GitHub repo. 

The plan is to start simple (using the WIkidata API, for example) and then slowly progress towards more advanced topics (e.g. your first SPARQL query from R and why it is not as complicated as people think, matching your data set against Wikidata entities in order to enrich it, and similar). I will provide an introduction to Semantic Web on a conceptual level only so that participants will not need a full understanding of the related technical standards (RDF, different serializations, etc) to follow through. 

Finally, we will show how to process the Wikidata JSON dump from R for those interested to play big games with R and the Semantic Web. We might be playing around with some interactive graph visualizations during the workshop. I think that Semantic Web is a new topic for many Data Scientist and that the R world definitely deserves a better introduction to it than it already has. 

The Workshop will rely on and extend the approach and material developed by Goran to introduce the Semantic Web and Wikidata in R in his 2019 MilanoR Meetup. The Covid-19 pandemic prevented the organizers of e-Rum 2020 to welcome us in beautiful Milano, Italy, but they fought like lions in spite of all the obstacles and managed to transfer the whole event online. With compliments to the organizers and looking forward to it!

MilanoR Meetup w. Goran S. Milovanović, 2019/06/25.

Many new interesting developments and training sessions alongside new collaborations will be taking place in DataKolektiv during the summer. Stay tuned.