Feedback should be send to goran.milovanovic@datakolektiv.com
. These notebooks accompany the Intro to Data Science: Non-Technical Background course 2020/21.
We will learn more about regular expressions and string processing in R. Please take into your consideration that regular expressions are indeed complicated and in that fact it takes a course on its own in order to master them completely. Here we will cover only some elementary applications of regular expressions that are useful in simple data cleaning operations.
Please consider this piece of documentation seriously: regex
grepl()
, grep()
, and regexpr()
Let’s begin by reviewing what we have already learned: grepl()
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
grepl(pattern = "plane", x = strings)
[1] TRUE TRUE FALSE FALSE TRUE
The base R function grepl()
asks us to define a pattern
- which is a regex - and then to provide a value for the x
argument which is a string (or a vector of strings, as in this example) in which we want to look for the pattern
. Our first example is very simple: we ask if "plane"
is present in "hyperplane"
, "airplane"
, "filter"
, "dplyr"
, or "plane"
, and the result is of course: TRUE
, TRUE
, FALSE
, FALSE
, TRUE
.
What about the following:
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
grepl(pattern = "Plane", x = strings)
[1] FALSE FALSE FALSE FALSE FALSE
Well, of course: "Plane"
is simply not the same as "plane"
. Now, grepl()
has another argument that we did not use before, ignore.case
. It is a logical one:
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
grepl(pattern = "Plane", x = strings, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE TRUE
Enter regular expressions: remember that ^
represents the beginning of the string?
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
grepl(pattern = "^plane", x = strings, ignore.case = TRUE)
[1] FALSE FALSE FALSE FALSE TRUE
and of course only "plane"
in strings
begins with "plane"
.
What elements of the strings
character vector end with plane
? We have already seen in this course that as ^
represents the empty character at the beginning of the string there is $
that represents the empty character at the end of the string:
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
grepl(pattern = "plane$", x = strings, ignore.case = TRUE)
[1] TRUE TRUE FALSE FALSE TRUE
Besides grepl()
we also have grep()
in base R:
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
grep(pattern = "plane", x = strings)
[1] 1 2 5
Unlike grepl()
that returns a logical vector, grep()
returns the indices of x
where the pattern is found. For example:
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
grep(pattern = "^plane", x = strings)
[1] 5
Or:
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
grep(pattern = "plane$", x = strings)
[1] 1 2 5
But we can also ask for the values from grep()
:
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
grep(pattern = "plane$", x = strings, value = T)
[1] "hyperplane" "airplane" "plane"
Another important base R function to work with regular expressions is regexpr()
.
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
regexpr(pattern = "plane", text = strings)
[1] 6 4 -1 -1 1
attr(,"match.length")
[1] 5 5 -1 -1 5
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
Please disregard the useBytes
and index.type
attributes for now. What is interesting here is the function output and the match.length
attribute. The value that the function returns is the position where pattern
begins in each element of text
in which the pattern
is actually found, with -1
indicating that the pattern was not found indeed:
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
foundWhere <- regexpr(pattern = "plane", text = strings)
foundWhere[1]
[1] 6
This means that the pattern
was found to begin in the 6h position of the first element of text
, which is: “hyperplane”. What was the length of the pattern found?
attr(foundWhere, 'match.length')[1]
[1] 5
Now this you have not seen before. R objects have attributes, and the attributes are accessed via the attr()
function. In this case, attr(foundWhere, 'match.length')[1]
asks for the first value of the match.length
attribute of the foundWhere
object.
This is how we find where patterns begin and end in strings in base R, look:
strings = c("hyperplane", "airplane", "filter", "dplyr", "plane")
foundWhere <- regexpr(pattern = "plane", text = strings)
start = foundWhere[1]
end = attr(foundWhere, 'match.length')[1]
cat(
paste0('"plane" is found in "',
strings[1],
'" beginning in the ',
start, 'th position and ending in the ',
start+end, 'th position'))
"plane" is found in "hyperplane" beginning in the 6th position and ending in the 11th position
and it follows that the value of 'match.length'
attribute corresponds to the value of the pattern
that we were looking for!
Hint: study what gregexpr()
does in base R + learn about the difference between print()
and cat()
in R.
We understand the meaning of ^
and $
in regex already. Now we want to learn about the meaning of +
, *
, .
, [
, ]
, and |
.
Imagine that we are facing the task to analyze some aspect of a system that encompasses user names of the following form:
userID <- "Maria0001449"
grepl("0001449", userID)
[1] TRUE
Of course. However, what if the system imposes the following rule: a user name must begin with the user’s real name, followed by any number of digits?
See:
userID <- "Maria0001449"
grepl("[0123456789]+$", userID)
[1] TRUE
What we find in between [
and ]
is a character class: it matches any character found in it. What +
means is: the previous character - or a character class, in our example - is found once or more than once. So, the semantics of [0123456789]+$
is exactly the following one:
0
, 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
is found,+
) once or more than once, and then$
).What if the system has the following rule for user names: a user name must begin with the user’s real name, followed or not by any number of digits?
userID <- "Maria0001449"
grepl("[0123456789]*$", userID)
[1] TRUE
Ok, but also:
userID <- "Maria"
grepl("[0123456789]*$", userID)
[1] TRUE
Because in regular expressions *
means: the previous repeats zero or as many times!
So, +
and *
are *quantifiers` in regular expressions. Again, character classes:
string <- "ABCDE"
grepl("[Y|I|O]", string)
[1] FALSE
What is |
? Let’ see:
string <- "ABCDE"
grepl("[Y|A|O]", string)
[1] TRUE
string <- "ABCDE"
grepl("[Y|9|O]", string)
[1] FALSE
string <- "ABCDE"
grepl("[D|E|0]", string)
[1] TRUE
So |
means: logical OR :)
The following one: .
is dangerous. The .
means: just anything:
string <- "ABCDE"
grepl(".", string)
[1] TRUE
And it is, of course, found everywhere:
gregexpr(".", string)
[[1]]
[1] 1 2 3 4 5
attr(,"match.length")
[1] 1 1 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
So what if we need to recognize .
literally in a string? Well, we have to escape it:
string <- "Goran.S.Milovanovic"
gregexpr("\.", string)
Error: '\.' is an unrecognized escape in character string starting ""\."
No, no… This is what we need:
string <- "Goran.S.Milovanovic"
gregexpr("\\.", string)
[[1]]
[1] 6 8
attr(,"match.length")
[1] 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
And yes .
is found in the sixth and eighth position in Goran.S.Milovanovic
!
Why the double backslash: \\
? Because the backslash escapes in R, but in regex also, of course, so we need to inform R that what follows the first \
needs to be interpreted not as some special character but literally as what it is, and then the second \
informs the regex engine that what follows it - and that would be .
- should be escaped, i.e. not interpreted as anything
(regex semantics) but as the .
character literally.
Complicated? You will get used to it, do not worry… Remember that Data Science is the sexiest profession of the 21st century. Well is it not? :)
Look:
string <- "Goran.S.Milovanovic0528$@3674"
grepl("[[:alpha:]]+\\.[[:alpha:]]\\.[[:alpha:]]+.+", string)
[1] TRUE
[[:alpha:]]+\\.[[:alpha:]]\\.[[:alpha:]]+.+
means:
[[:alpha:]]
),+
), is followed by\\.
), followed by[[:alpha:]]
), followed by\\.
), followed by[[:alpha:]]
),+
), followed by.+
).What is [[:alpha:]]
?
From the Regex: Regular Expressions As Used In R documentation page:
Certain named classes of characters are predefined.
[:alnum:]
Alphanumeric characters: [:alpha:] and [:digit:].
[:alpha:]
Alphabetic characters: [:lower:] and [:upper:].
[:blank:]
Blank characters: space and tab, and possibly other locale-dependent characters such as non-breaking space.
[:cntrl:]
Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In another character set, these are the equivalent characters, if any.
[:digit:]
Digits: 0 1 2 3 4 5 6 7 8 9.
[:graph:]
Graphical characters: [:alnum:] and [:punct:].
[:lower:]
Lower-case letters in the current locale.
[:print:]
Printable characters: [:alnum:], [:punct:] and space.
[:punct:]
Punctuation characters: ! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
[:space:]
Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters.
[:upper:]
Upper-case letters in the current locale.
[:xdigit:]
Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f.
Q. Ok, but why the double square brackets, like in: [[:alpha:]]
?
A. Because [
and ]
in [:alpha:]
are simply a part of the predefined name, and we still want to inform the regex engine that we mean: a character class.
gsub()
Did we mention gsub()
in the past? I think we already did:
string <- "New York City"
gsub("New", "Old", string)
[1] "Old York City"
Ok. Now what if I would like to change each occurrence of #err
to an empty string ""
in the following:
string <- "someDatabas#erreRecordGoneWrong"
gsub("#err", "", string)
[1] "someDatabaseRecordGoneWrong"
Great. Now we know how to delete things that we do not need in strings.
Now, the backreferences in regex with gsub()
. Imagine that we are facing the following situation:
strings <- c("NewYork", "NewAmsterdam", "NewBelgrade")
print(strings)
[1] "NewYork" "NewAmsterdam" "NewBelgrade"
Obviously, a set of typos, all missing a white space, which can be fixed in the following way:
gsub("(New)", "\\1 ", strings)
[1] "New York" "New Amsterdam" "New Belgrade"
\\1
in this example is a backreference which refers to the first parenthesized expression in the pattern "(New)"
; it will be replaced by itself concatenated with a white space - "\\1 "
in gsub()
! Let’s elaborate; in
gsub(pattern, replacement, x)
we have used "(New)"
as a pattern
, and "\\1 "
as a replacement
, where strings
played a role of x
: the replacement
used \\1
as a backreference to (New)
in the pattern
argument "(New)"
.
Once again: Gaston Sanchez’s “Handling and Processing Strings in R” - the chances you will ever need more than what’s covered in this text-book are slim.
Regular Expressions: go pro. Regular-Expressions.info is a well known learning resource. In order to figure out the specific regex standard used in R: Regular Expressions as used in R. This section of Regular-Expressions.info is on regex in R specifically.
R Markdown is what I have used to produce this beautiful Notebook. We will learn more about it near the end of the course, but if you already feel ready to dive deep, here’s a book: R Markdown: The Definitive Guide, Yihui Xie, J. J. Allaire, Garrett Grolemunds.
Goran S. Milovanović
DataKolektiv, 2020/21
contact: goran.milovanovic@datakolektiv.com
License: GPLv3 This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.