class: center, middle, inverse, title-slide # Super fun times with strings! ### (
@RLadiesLancs
) ### 2019-10-02 --- layout: true <div class = "rladies-header"> <span class="social"><table><tr><td><img src="images/twitter.gif"/></td><td> @RLadiesLancs</td></tr></table></span> </div> --- # Tonight * **stringr** package * An introduction to regular expressions (Regex) ### Slides 👨🏫 [github.com/rladies](https://github.com/rladies/) ### Book chapter 📖 [r4ds.had.co.nz](https://r4ds.had.co.nz/transform.html) ### Slack 💬 [bit.ly/R4DSslack](https://bit.ly/R4DSslack ) --- # Thanks to our sponsors 🥨 <img src="images/jr.png" width="550px" style="display: block; margin: auto;" /> --- # stringr To access the datasets, help pages, and functions that we will use tonight, load the stringr by running this code: ```r install.packages("stringr") ``` ```r library("stringr") ``` <img src="images/stringr.png" width="250px" style="display: block; margin: auto;" /> --- # **stringr** - stringr manipulation * Handling text in data sets can be challenging * Typical problems include * Extra whitespace * Incorrectly capitalised names * Poor spelling * **stringr** has lots of tools to deal with strings * RStudio has provided a [cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/strings.pdf) --- # String length We first need to load **stringr** and we also need basic string to practice on ```r library("tidyverse") review = "The food was delicious" ``` * We can find the length of a string with `str_length()` ```r str_length(review) ``` ``` ## [1] 22 ``` * Note that this function counts whitespace --- # Subsetting strings The function `str_sub()` gives us a handy way to extract characters ```r str_sub(review, 5, 8) ``` ``` ## [1] "food" ``` --- # Subsetting strings * Positive integers count from the left * Negative integers count from the right ```r str_sub(review, -9, -1) ``` ``` ## [1] "delicious" ``` We can also use this to modify strings ```r str_sub(review, -9, -1) = "awful" review ``` ``` ## [1] "The food was awful" ``` --- # Replacing strings * Use `str_replace_all()` to match and replace patterns ```r str_replace_all(review, pattern = "food", replacement = "service") ``` ``` ## [1] "The service was awful" ``` --- # Detecting strings * If we have a vector of strings, we can look for patterns within each string ```r review = c("The food was delicious", "The staff were unhelpful", "The service was ok") str_detect(review, "delicious") ``` ``` ## [1] TRUE FALSE FALSE ``` ```r str_which(review, "delicious") ``` ``` ## [1] 1 ``` ```r str_subset(review, "delicious") ``` ``` ## [1] "The food was delicious" ``` --- # Dealing with whitespace * Whitespace can be a pain when working with strings. ```r names = c("John", "Abby", " Thomas", " Zak") str_sort(names) ``` ``` ## [1] " Thomas" " Zak" "Abby" "John" ``` * The string has been sorted incorrectly due to the whitespace. * We can use `str_trim()` to remove any leading or trailing whitespace from strings. ```r names %>% str_trim() %>% str_sort() ``` ``` ## [1] "Abby" "John" "Thomas" "Zak" ``` --- # Squish! * We can also trim excessive whitespace within a string, using `str_squish()`. ```r bond = c(" James Bond ") str_squish(bond) ``` ``` ## [1] "James Bond" ``` * Notice this also trims leading and trailing whitespace. --- # Case transformation * `str_to_upper()` converts all to upper * `str_to_lower()` converts all to lower * `str_to_title()` capitalises the first character of each word ```r str_to_upper(names) ``` ``` ## [1] "JOHN" "ABBY" " THOMAS" " ZAK" ``` ```r str_to_lower(names) ``` ``` ## [1] "john" "abby" " thomas" " zak" ``` ```r bad_names = c("JONATHON butterwood", "colin BATTERSEY") str_to_title(bad_names) ``` ``` ## [1] "Jonathon Butterwood" "Colin Battersey" ``` --- # Exercises 1. Set `longword = "Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch"` - Use str_subset to extract the first 16 letters - What is the 10th letter? Use `str_length()`to extract the middle two character(s) 2. Using `stringr::sentences`, answer the following. * How many sentences contain the word "green"? * Which sentence has the most characters? Hint: look at the help for `which.max()` * Convert all the sentences to uppercase. * What does `str_count()` do? * Replace all fullstops with exclaimation marks. Hint: you'll need to use `\\.` and `\\!` ### Slies 👨🏫 [github.com/rladies](https://github.com/rladies/) --- # Regular Expressions * General language to describe patterns in strings. * Practice with `str_view()` ```r x <- c("apple", "banana", "pear") str_view(x, "an") ```
--- # Any character `.` The next step up in complexity is `.`, which matches any character (except a newline): ```r str_view(x, ".a.") ```
--- # Escape! * How do we search for a literal dot `.` ? * `\` is the character to escape * In R, we need to type two backslashes in a string. * Therefore to search for a dot we need to use `\\.` For example, to find the phrase `a.c` ```r str_view(c("abc", "a.c", "bef"), "a\\.c") ```
--- # Escape escape! * How do we search for a backslash `\`? * Well we need to escape it. * `\\\\` ```r x <- c("apple", "banana", "pear", "plum\\peach") str_view(x, "\\\\") ```
--- # Anchors ⚓ * By default, regular expressions will match any part of a string. * We can anchor the regular expression to match from the start or end of the string. * `^` to match the start of the string. ```r x <- c("apple", "banana", "pear") str_view(x, "^a") ```
--- # Anchors ⚓ * By default, regular expressions will match any part of a string. * We can anchor the regular expression to match from the start or end of the string. * `$` to match the end of the string. ```r x <- c("apple", "banana", "pear") str_view(x, "a$") ```
--- # Anchors ⚓ * Putting it all together * Use both `^` and `$` ```r x <- c("apple", "banana", "pear") str_view(x, "^apple$") ```
--- # Character classes There are a number of special patterns that match more than one character.You’ve already seen `.`, which matches any character apart from a newline. * `\d` or `[0-9]`: matches any digit. * `[a-z]`: any letter of the alphabet. * `\s:` matches any whitespace (e.g. space, tab, newline). * `[abc]`: matches a, b, or c. * `[^abc]`: matches anything except a, b, or c. Remember, to create a regular expression containing `\d` or `\s`, you’ll need to escape the `\` for the string, so you’ll type `\\d` or `\\s`. ```r x <- c("apple1", "apple2", "apple", "banana") str_view(x, "apple[0-9]") ```
--- # Repetition We can also look for how many times a pattern matches: * ?: 0 or 1 * +: 1 or more * *: 0 or more * {n}: exactly n * {n, }: n or more * {, m}: at most m * {n, m}: between n and m ```r x <- c("colour", "color", "colours") str_view(x, "colo(u)?r") ```
--- # Grouping and backreferences ```r str_view(fruit, "(..)\\1", match = TRUE) ```
--- # Exercises Using `stringr::words`, create regular expressions that find all words that: 1. Start with “y”. 2. End with “x” 3. Are exactly three letters long. (Don’t cheat by using `str_length()`!) 4. Have seven letters or more. 5. Start with a vowel. 6. That only contain consonants. (Hint: thinking about matching “not” - vowels.) 7. End with ed, but not with eed. 8. End with ing or ise. 9. Contains consecutive triple vowels. (e.g "iou") You might want to use the match argument to `str_view()` to show only the matching or non-matching words. Or use `str_subset()`. ```r str_view(words, pattern = "^ab", match = TRUE) ``` or ```r str_subset(words, "^ab") ```