Feedback should be send to goran.milovanovic@datakolektiv.com
. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.
We have learned a lot about vectors and matrices in R already. However, following the first four (intensive) sessions on R programming, covering everything from vectors and lists (that are also vectors in R) to iterations, decisions, and functions … that knowledge might be scattered a bit. Now we want to consolidate our knowledge on vectors and then introduce multidimensional arrays and some basic linear algebra. After all, understanding how vectors operate in a vectorized programming language is pretty much part of being in command… Following our overview of vectors, matrices, and arrays, we proceed to a super-important topic of strings and text processing in R. We introduce the {stringr} package and discuss the basics of Regular expressions (regex). While Regular expression are a topic that deserves a course on their own, the basics are definitely an essential part of any Data Science and Analytics role.
Install the following packages:
install.packages('stringr')
Note. By now, many of you have probably already installed {tidyverse}. If that is the case, library(tidyverse)
would do just fine - {stringr} is there.
A reminder. First of all: vectoriziation is always turned on, that is simply the nature of R…
a <- c(7, 1, 3, 9, 15)
b <- 5
a + b
[1] 12 6 8 14 20
… but recycling is also always on: the result that we have observed is a consequence of the fact that b
, a numeric vector of length one, was recycled as many times as was necessary to meet the length of a
which is five. See:
a <- 1:10
b <- c(2, 3)
a ^ b
[1] 1 8 9 64 25 216 49 512 81 1000
Square, then cube, then square, then cube… and so on. Because we are recycling b <- c(2, 3)
.
The same for matrices:
a <- matrix(1:9,
ncol = 3)
print(a)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Now
a^2
[,1] [,2] [,3]
[1,] 1 16 49
[2,] 4 25 64
[3,] 9 36 81
But
a^c(2, 3)
longer object length is not a multiple of shorter object length
[,1] [,2] [,3]
[1,] 1 64 49
[2,] 8 25 512
[3,] 9 216 81
Again: how does R order the indices of a matrix? Mind the warning, by the way.
The recycling rule:
If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector.
Now, as of subsetting vectors and matrices.
a <- seq(2, 100, 2)
print(a)
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
[24] 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92
[47] 94 96 98 100
We can subset by indices:
a[1:20]
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
But we can also create a mask and subset by it:
a <- seq(2, 100, 2)
print(a)
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
[24] 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92
[47] 94 96 98 100
mask <- rep(c(T, F), times = length(a)/2)
print(mask)
[1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
[16] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
[31] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
[46] FALSE TRUE FALSE TRUE FALSE
length(a) == length(mask)
[1] TRUE
a_mask <- a[mask]
print(a_mask)
[1] 2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62 66 70 74 78 82 86 90 94 98
length(a_mask)
[1] 25
Reminder. Unidimensional vectors do not have a dimension in R:
print(a)
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
[24] 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92
[47] 94 96 98 100
dim(a)
NULL
They only have a length:
length(a)
[1] 50
Unlike matrices or dataframes:
a <- matrix(1:9,
ncol = 3)
dim(a)
[1] 3 3
Did you ever think about using negative indices?
a <- 1:10
a[-2]
[1] 1 3 4 5 6 7 8 9 10
So, negative indices delete elements from a vector, as well as FALSE
deletes them when used in a mask! See:
a <- matrix(1:9,
nrow = 3)
print(a)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Now:
a[-2, -2]
[,1] [,2]
[1,] 1 7
[2,] 3 9
What has just happened? Well… [-2, -2]
means: remove the 2nd row and the 2nd column. There are interesting combinations to remember, such as…
a[-2, ]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 3 6 9
… which reads: remove the second row, but keep all columns. Remember how we used to subset dataframes? Or:
a[-2, 3]
[1] 7 9
^^ removed the 2nd row, and then kept everything from the 3rd column of a
. Mind the classes, it is not a matrix
anymore…
class(a[-2, 3])
[1] "integer"
… so dim(a[-2, 3])
is, of course:
dim(a[-2, 3])
NULL
Let’s begin by creating two vectors, arr1
and arr2
:
arr1 <- seq(2,20,2)
arr2 <- seq(1,19,2)
print("arr1: ")
[1] "arr1: "
print(arr1)
[1] 2 4 6 8 10 12 14 16 18 20
print("arr2: ")
[1] "arr2: "
print(arr2)
[1] 1 3 5 7 9 11 13 15 17 19
Vectorized, element-wise multiplication:
arr1 * arr2
[1] 2 12 30 56 90 132 182 240 306 380
Now, introduce the scalar product (“dot product”, or “inner product”: the sum of the products of the corresponding entries of the two sequences of numbers) in R with %*%
:
arr1 %*% arr2
[,1]
[1,] 1430
which is, of course, the same as:
sum(arr1 * arr2)
[1] 1430
Now we introduce the transpose, t()
. It is more intuitive to begin with a matrix:
mat <- matrix(1:9,
ncol = 3)
print(mat)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
And t(mat)
is:
t(mat)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
It is easy to understand: the rows become columns, and the columns become rows. But what happens if we transpose a unidimensional array of numbers?
print(arr1)
[1] 2 4 6 8 10 12 14 16 18 20
t(arr1)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 4 6 8 10 12 14 16 18 20
No difference? Not really. R defaults to column vectors; only the second example (i.e. t(arr1)
) is a row vector.
Dot product, again:
# - arr1 will become a row vector after t();
# - arr2 will remain a column vector:
t(arr1) %*% arr2
[,1]
[1,] 1430
But:
# - arr1 will be a column vector;
# - arr2 will become a row vector after t():
arr1 %*% t(arr2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 6 10 14 18 22 26 30 34 38
[2,] 4 12 20 28 36 44 52 60 68 76
[3,] 6 18 30 42 54 66 78 90 102 114
[4,] 8 24 40 56 72 88 104 120 136 152
[5,] 10 30 50 70 90 110 130 150 170 190
[6,] 12 36 60 84 108 132 156 180 204 228
[7,] 14 42 70 98 126 154 182 210 238 266
[8,] 16 48 80 112 144 176 208 240 272 304
[9,] 18 54 90 126 162 198 234 270 306 342
[10,] 20 60 100 140 180 220 260 300 340 380
A faster way to obtain a dot product of two vectors is to use crossprod()
:
crossprod(arr1,arr2)
[,1]
[1,] 1430
But the class of crossprod(arr1,arr2)
will be:
class(crossprod(arr1,arr2))
[1] "matrix" "array"
drop()
can be used to strip the matrix
and array
classes and obtain a scalar value as a result:
# as scalar:
drop(crossprod(arr1, arr2))
[1] 1430
Also, a more efficient way to obtain arr1 %*% t(arr2)
is to use tcrossproduct()
:
tcrossprod(arr1, arr2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 6 10 14 18 22 26 30 34 38
[2,] 4 12 20 28 36 44 52 60 68 76
[3,] 6 18 30 42 54 66 78 90 102 114
[4,] 8 24 40 56 72 88 104 120 136 152
[5,] 10 30 50 70 90 110 130 150 170 190
[6,] 12 36 60 84 108 132 156 180 204 228
[7,] 14 42 70 98 126 154 182 210 238 266
[8,] 16 48 80 112 144 176 208 240 272 304
[9,] 18 54 90 126 162 198 234 270 306 342
[10,] 20 60 100 140 180 220 260 300 340 380
in place of the already seen, but slower:
arr1 %*% t(arr2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2 6 10 14 18 22 26 30 34 38
[2,] 4 12 20 28 36 44 52 60 68 76
[3,] 6 18 30 42 54 66 78 90 102 114
[4,] 8 24 40 56 72 88 104 120 136 152
[5,] 10 30 50 70 90 110 130 150 170 190
[6,] 12 36 60 84 108 132 156 180 204 228
[7,] 14 42 70 98 126 154 182 210 238 266
[8,] 16 48 80 112 144 176 208 240 272 304
[9,] 18 54 90 126 162 198 234 270 306 342
[10,] 20 60 100 140 180 220 260 300 340 380
Note. From the
crossprod()
documentation: Vectors are promoted to single-column or single-row matrices, depending on the context.
Basic matric algebra:
mat1 <- matrix(1:9,
nrow = 3)
mat1
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
mat2 <- matrix(seq(2, 18, 2),
nrow = 3)
mat2
[,1] [,2] [,3]
[1,] 2 8 14
[2,] 4 10 16
[3,] 6 12 18
Matrix multiplication vectorized is, again, element-wise in R:
mat1 * mat2
[,1] [,2] [,3]
[1,] 2 32 98
[2,] 8 50 128
[3,] 18 72 162
Real algebraic matrix multiplication is obtained by %*%
:
mat1 %*% mat2
[,1] [,2] [,3]
[1,] 60 132 204
[2,] 72 162 252
[3,] 84 192 300
And then, what is often used in statistics, X'X
, is of course:
crossprod(mat1, mat2)
[,1] [,2] [,3]
[1,] 28 64 100
[2,] 64 154 244
[3,] 100 244 388
which is the same as (less efficient):
t(mat1) %*% mat2
[,1] [,2] [,3]
[1,] 28 64 100
[2,] 64 154 244
[3,] 100 244 388
While XX'
is:
tcrossprod(mat1, mat2)
[,1] [,2] [,3]
[1,] 132 156 180
[2,] 156 186 216
[3,] 180 216 252
the same as (less efficient):
mat1 %*% t(mat2)
[,1] [,2] [,3]
[1,] 132 156 180
[2,] 156 186 216
[3,] 180 216 252
Multdimensional arrays in R are created by array()
input <- c(5, 9, 3, 10, 11, 12, 13, 14, 15)
length(input)
[1] 9
arr1 <- array(vector1,
dim = c(3, 3, 2))
print(arr1)
, , 1
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
, , 2
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
arr1[, , 1]
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
arr1[, , 2]
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
Let’s check something:
prod(c(3, 3, 2)) == length(input)
[1] FALSE
So arr1
was produced by recycling - that is why arr1[ , , 1]
and arr[ , , 3]
are identical()
:
identical(arr1[ , , 1], arr1[ , , 2])
[1] TRUE
Everything else works as expected:
apply(arr1, 1, sum)
[1] 56 68 60
apply(arr1, 2, sum)
[1] 34 66 84
apply(arr1, 3, sum)
[1] 92 92
We will now provide a very short and concise overview of some of the R’s functionality for string processing. The later is found among the most interesting and difficult topics in computer science. On the other hand, the work of a contemporary Data Scientist - a practitioner who needs to invest time and resources to get its data sets cleaned and properly formatted for mathematical modeling - is heavily loaded with text and string processing steps. Many data sources that are available out there provide only unstructured, or semi-structured data, and that’s were the skills of string handling, text processing, and, finally, data wrangling (next session) come into play. The caveat here is that string processing is a huge domain in itself, and that is why we can provide an overview and an introduction here. It’s one of those things were a disciple becomes an expert by necessity, and were progress really means practice.
To go beyond this session: Gaston Sanchez’s “Handling and Processing Strings in R” is probably the best that is out there.
library(stringr)
On {stringr}, from Introduction to stringr, 2016-08-19: “Simplifies string operations by eliminating options that you don’t need 95% of the time (the other 5% of the time you can functions from base R or stringi)” - and it reallly does. Now,
Kick it! Strings in R are character vectors:
string_1 <- "Hello world"
string_2 <- "Sun shines!"
string_1
[1] "Hello world"
string_2
[1] "Sun shines!"
is.character(string_1) # TRUE
[1] TRUE
as.character(200*5)
[1] "1000"
as.numeric("1000")
[1] 1000
as.double("3.14")
[1] 3.14
Remember the character
data type? Strings in R are nothing but instantiations of this data type. A character
is a very “old” data type in R, so that all integers and doubled coerce to characters when appropriate. For example,
number <- 10
paste("Text", number)
[1] "Text 10"
We will discuss paste()
later, but you can see from the example that is “puts things together into a character vector” (it concatenates strings, technically). However, the numeric 10
is lost in a new string, isn’t it… in R coercion, character
eats everything.
One needs to be careful when it comes to quoting string constants here (i.e. minding the occasion when the usage of '
and "
is appropriate):
# Using " and '
# either:
string_1 <- "Hello 'World'"
string_1
[1] "Hello 'World'"
# or
string_1 <- 'Hello "World"'
string_1 # prints: "Hello \"World\"" - what is this: \ ?
[1] "Hello \"World\""
What is this: \
?!! It was not in my string? Don’t worry, \
is R’s escape character. In the character vector above - 'Hello "World"'
- we find two instantiations of "
enclosed by '
. On the output, R transferred all instantiations of '
to "
, making it four instantiations of "
altogether now. The escape character \
is used to signal that the second instantiation of "
is not a beginning of a new string, but a token to be printed, and that the third instantiation of "
is not an ending of a string, but also a token to be printed to the output device.
If you care about this much, take a look at the difference between writeLines()
and print()
:
# try:
writeLines(string_1)
Hello "World"
print(string_1)
[1] "Hello \"World\""
You could also start experimenting with cat()
. More on escapism in R:
# Escaping in R: use \, the R escape character
string_1 <- 'Hello \"World\"'
string_1
[1] "Hello \"World\""
writeLines(string_1)
Hello "World"
Escaping the escape character:
writeLines("\\") # nice
\
Yes that’s how you get to use the escape character as a printable character in R, if you were wondering. Wait until it comes to regular expressions where things in R really tend to get nasty.
To obtain a length of a string in R…
# Length of strings
length(string_1) # of course
[1] 1
But of course it is. Maybe nchar()
would do better:
nchar(string_1) # base function
[1] 13
Concatenating strings in R:
string_3 <- c(string_1, string_2) # a character vector of length == 2
writeLines(string_3)
Hello "World"
Sun shines!
No. No, no, no… that’s a character vector of length == 2, we need to use paste()
here:
string_3 <- paste(string_1, string_2, sep = ", ") # length == 1, base function
writeLines(string_3)
Hello "World", Sun shines!
Where {base} has paste()
, {stringr} has str_c()
:
strD <- c("First", "Second", "Third")
# both paste {base} and str_c {stringr} are vectorized
paste("Prefix-", strD, sep = "-") # - base R
[1] "Prefix--First" "Prefix--Second" "Prefix--Third"
str_c("Prefix-", strD, sep = "-") # {stringr}
[1] "Prefix--First" "Prefix--Second" "Prefix--Third"
How to split strings into subcomponents? In {base} it’s done by strsplit()
, while {stringr} has ‘str_split()’:
# Splitting strings in R
# with strsplit {base}
string_1 <- "The quick brown fox jumps over the lazy dog"
string_1
[1] "The quick brown fox jumps over the lazy dog"
Base R:
splitA <- strsplit(string_1, " ") # is.list(splitA) == T
splitA
[[1]]
[1] "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"
strsplit()
returns a list; unlist()
it to get to your result:
splitA <- unlist(strsplit(string_1, " "))
splitA
[1] "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"
Extracting a part of it by combining strsplit()
and paste()
:
# "The quick brown" from "The quick brown fox jumps over the lazy dog"
splitA <- paste(unlist(strsplit(string_1," "))[1:3], collapse = " ")
splitA
[1] "The quick brown"
string_1
[1] "The quick brown fox jumps over the lazy dog"
There’s a fixed
argument that you need to know about in strsplit()
:
splitA <- strsplit(string_1," ")
splitA
[[1]]
[1] "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"
splitA <- strsplit(string_1," ", fixed = T)
# fixed=T says: match the split argument
# exactly, otherwise, split is an regular expression; default is: fixed = FALSE
splitA
[[1]]
[1] "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"
The str_split()
function in {stringr} has some very useful, additional functionality in comparison to {base} strplit()
. For example:
string_11 <- "Above all, don't lie to yourself. The man who lies to himself and listens to his own lie comes to a point that he cannot distinguish the truth within him, or around him, and so loses all respect for himself and for others. And having no respect he ceases to love."
string_11
[1] "Above all, don't lie to yourself. The man who lies to himself and listens to his own lie comes to a point that he cannot distinguish the truth within him, or around him, and so loses all respect for himself and for others. And having no respect he ceases to love."
str_split(string_11, boundary("word"))
[[1]]
[1] "Above" "all" "don't" "lie" "to" "yourself"
[7] "The" "man" "who" "lies" "to" "himself"
[13] "and" "listens" "to" "his" "own" "lie"
[19] "comes" "to" "a" "point" "that" "he"
[25] "cannot" "distinguish" "the" "truth" "within" "him"
[31] "or" "around" "him" "and" "so" "loses"
[37] "all" "respect" "for" "himself" "and" "for"
[43] "others" "And" "having" "no" "respect" "he"
[49] "ceases" "to" "love"
# including punctuation and special characters
str_split(string_11, boundary("word", skip_word_none = F))
[[1]]
[1] "Above" " " "all" "," " " "don't"
[7] " " "lie" " " "to" " " "yourself"
[13] "." " " "The" " " "man" " "
[19] "who" " " "lies" " " "to" " "
[25] "himself" " " "and" " " "listens" " "
[31] "to" " " "his" " " "own" " "
[37] "lie" " " "comes" " " "to" " "
[43] "a" " " "point" " " "that" " "
[49] "he" " " "cannot" " " "distinguish" " "
[55] "the" " " "truth" " " "within" " "
[61] "him" "," " " "or" " " "around"
[67] " " "him" "," " " "and" " "
[73] "so" " " "loses" " " "all" " "
[79] "respect" " " "for" " " "himself" " "
[85] "and" " " "for" " " "others" "."
[91] " " "And" " " "having" " " "no"
[97] " " "respect" " " "he" " " "ceases"
[103] " " "to" " " "love" "."
See, I have a character vector, and I need only the first three characters from each component:
# Subsetting strings
string_1 <- c("Data", "Science", "Serbia")
# {base}
substr(string_1, 1, 3)
[1] "Dat" "Sci" "Ser"
Let’s start transforming strings with substr()
:
# {base}
string_2 <- string_1 # just a copy of string_1
substr(string_2, 1, 3) <- "WowWow" # check the result!
string_2
[1] "Wowa" "Wowence" "Wowbia"
substr(string_2, 1, 4) <- "WowWow" # check the result!
string_2
[1] "WowW" "WowWnce" "WowWia"
substr(string_2, 1, 6) <- "WowWow" # check the result!
string_2
[1] "WowW" "WowWowe" "WowWow"
UPPER CASE to lower case w. tolower()
:
string_1 <- "Belgrade"
# {base}
tolower(string_1)
[1] "belgrade"
Now everything to UPPER CASE with {base} toupper()
:
string_1 <- tolower(string_1)
toupper(string_1)
[1] "BELGRADE"
A useful {stringr} function str_to_title()
capitalizes only the first character:
string_1 <- c("belgrade", "paris", "london", "moscow")
str_to_title(string_1)
[1] "Belgrade" "Paris" "London" "Moscow"
Removing overhead white spaces from strings is a notorious operation in text-mining:
# Remove whitespace
string_1 <- c(" Remove whitespace ");
string_1
[1] " Remove whitespace "
There goes {stringr} str_trim()
to clean-up:
str_trim(string_1) # {stringr}
[1] "Remove whitespace"
There’s a side
argument that we use to remove the leading (side = ‘left’) and trailing (side = ‘right’) whitespaces:
# remove leading whitespace
str_trim(string_1, side = "left")
[1] "Remove whitespace "
# remove trailing whitespace
str_trim(string_1, side = "right")
[1] " Remove whitespace"
Using {base} gsub()
to remove all whitespace:
# remove all whitespace?
string_1 <- c(" Remove whitespace ") # how about this one?
string_1
[1] " Remove whitespace "
# there are different ways to do it. Try:
gsub(" ", "", string_1, fixed = T) # (!(fixed==T)), the first (pattern) argument is regex
[1] "Removewhitespace"
gsub()
is definitely something you need to learn about:
# replacing, in general:
string_1 <- "The quick brown fox jumps over the lazy dog The quick brown"
gsub("The quick brown", "The slow red", string_1, fixed=T)
[1] "The slow red fox jumps over the lazy dog The slow red"
Again, mind the fixed
argument - by default, gsub()
likes regular expressions.
string_1
[1] "The quick brown fox jumps over the lazy dog The quick brown"
Does string_1
contain The quick brown
?
# Searching for something in a string {stringr}
str_detect(string_1, "The quick brown") # T or F
[1] TRUE
Where is it? Use str_locate
from {stringr}:
str_locate(string_1, "The quick brown")[[1]] # first match
[1] 1
And what if there is more than one match?
str_locate_all(string_1, "The quick brown")[[1]] # all matches
start end
[1,] 1 15
[2,] 45 59
You might have heard that people in text-mining use term-frequency matrices a lot. These matrices typically list all interesting terms from a set of documents in their rows, and the documents themselves are represented by columns; cell entries are counts that provide an information on how many times a particular term have occurred in a particular document.
We will not build a full term-frequency matrix in R now (check the {tm} package for R’s functionality in text-mining), but only demonstrate how to use str_locate_all()
to count the number of occurrences:
# term frequency, as we know, is very important in text-mining:
term1 <- str_locate_all(string_1, "The quick brown")[[1]] # all matches for term1
# ie. "The quick brown"
term1
start end
[1,] 1 15
[2,] 45 59
Hm, it’s easy now:
dim(term1)[1] # how many matches = how many rows in the str_locate_all output matrix
[1] 2
# Sorting character vectors in R {base}
string_1 <- c("New York", "Paris", "London", "Moscow", "Tokyo")
string_1
[1] "New York" "Paris" "London" "Moscow" "Tokyo"
It’s really easy:
sort(string_1)
[1] "London" "Moscow" "New York" "Paris" "Tokyo"
And with decreasing=T
:
sort(string_1, decreasing = T)
[1] "Tokyo" "Paris" "New York" "Moscow" "London"
Once again: Gaston Sanchez’s “Handling and Processing Strings in R” - the chances you will ever need more than what’s covered in this text-book are slim.
Regular Expressions: go pro. Regular-Expressions.info is a well known learning resource. In order to figure out the specific regex standard used in R: Regular Expressions as used in R. This section of Regular-Expressions.info is on regex in R specifically.
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
A specialized R Markdown Notebook on Regular expressions will be shared soon. The exercises will be found there.
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.