Guillaume Abgrall
2019-04-25
- Whats Covered
- Aditional Resources
- Mutating Joins
- Welcome to the course!
- Keys
- Primary Keys
- Secondary Keys
- Multi-variable keys
- Joins
- A basic join
- A second join
- A right join
- Variations on joins
- Inner joins and full joins
- Pipes
- Practice with pipes and joins
- Choose your joins
- Session info
Whats Covered
- Mutating joins
- Filtering joins and set operations
- Assembling data
- Advanced joining
- Case Study
Aditional Resources
Welcome to the course!
- Advantages to
dplyr
over base Rmerge
function for joing data- dplyr always preserves the row order
- dplyr has much more intuitive syntax
- dplyr can be applied to databases, or spark
- dplyr is a front end language for maniulating data that can be converted to multiple backends like SQL or spark.
- This is pretty useful. Everything we cover in this class could work on a spark dataframe or a sql table
Keys
- The Primary key needs to be unique in a table
- The foreign key in the second table can be duplicated
- second table will be matched to the primary table based on the primary key
- The primary key may be one, two or even more columns in the table
Primary Keys
- What is the primary key in artists dataset?
name
artists0 %>% head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
name | instrument |
---|---|
Jimmy Buffett | Guitar |
George Harrison | Guitar |
Mick Jagger | Vocals |
Tom Jones | Vocals |
Davy Jones | Vocals |
John Lennon | Guitar |
Paul McCartney | Bass |
Jimmy Page | Guitar |
Joe Perry | Guitar |
Elvis Presley | Vocals |
Secondary Keys
- Which datasets have a primary key that matches artists\(name? + `bands\)name
+
songs$writer`
albums0 %>% head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
name | band | year |
---|---|---|
A Hard Day’s Night | The Beatles | 1964 |
Magical Mystery Tour | The Beatles | 1967 |
Beggar’s Banquet | The Rolling Stones | 1968 |
Abbey Road | The Beatles | 1969 |
Led Zeppelin IV | Led Zeppelin | 1971 |
The Dark Side of the Moon | Pink Floyd | 1973 |
Aerosmith | Aerosmith | 1973 |
Rumours | Fleetwood Mac | 1977 |
Hotel California | Eagles | 1982 |
bands0 %>% head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
name | band |
---|---|
John Bonham | Led Zeppelin |
John Paul Jones | Led Zeppelin |
Jimmy Page | Led Zeppelin |
Robert Plant | Led Zeppelin |
George Harrison | The Beatles |
John Lennon | The Beatles |
Paul McCartney | The Beatles |
Ringo Starr | The Beatles |
Jimmy Buffett | The Coral Reefers |
Mick Jagger | The Rolling Stones |
songs0 %>% head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
song | album | writer |
---|---|---|
Come Together | Abbey Road | John Lennon |
Dream On | Aerosmith | Steven Tyler |
Hello, Goodbye | Magical Mystery Tour | Paul McCartney |
It’s Not Unusual | Along Came Jones | Tom Jones |
Multi-variable keys
- What is the primary key in artists?
- The combination of
first
andlast
- The combination of
artists %>% head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
first | last | instrument |
---|---|---|
Jimmy | Buffett | Guitar |
George | Harrison | Guitar |
Mick | Jagger | Vocals |
Tom | Jones | Vocals |
Davy | Jones | Vocals |
John | Lennon | Guitar |
Paul | McCartney | Bass |
Jimmy | Page | Guitar |
Joe | Perry | Guitar |
Elvis | Presley | Vocals |
The combination of first and last uniquely identifies each person in the dataset, just as the combination of first and last names tend to uniquely identify people in real life.
Joins
- Joins in dplyr will work on data frames, tibbles (tbl_df), and tbl references
A basic join
As Garrett mentioned in the video, left_join()
is the basic join function in dplyr
. You can use it whenever you want to augment a data frame with information from another data frame.
For example, left_join(x, y)
joins y
to x
. The second dataset you specify is joined to the first dataset.
# Complete the code to join artists to bandsbands2 <- left_join(bands, artists, by = c('first','last'))# Examine the resultsbands2 %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
first | last | band | instrument |
---|---|---|---|
John | Bonham | Led Zeppelin | NA |
John Paul | Jones | Led Zeppelin | NA |
Jimmy | Page | Led Zeppelin | Guitar |
Robert | Plant | Led Zeppelin | NA |
George | Harrison | The Beatles | Guitar |
John | Lennon | The Beatles | Guitar |
Paul | McCartney | The Beatles | Bass |
Ringo | Starr | The Beatles | Drums |
Jimmy | Buffett | The Coral Reefers | Guitar |
Mick | Jagger | The Rolling Stones | Vocals |
Keith | Richards | The Rolling Stones | Guitar |
Charlie | Watts | The Rolling Stones | NA |
Ronnie | Wood | The Rolling Stones | NA |
A second join
You should always check the output of your joins. Just because a join worked, doesn’t mean that it worked as you expected.
For example, the code in the editor joins the same two datasets that you joined in the previous exercise, but it returns a different result. Can you tell what went wrong?
left_join(bands, artists, by = "first") %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
first | last.x | band | last.y | instrument |
---|---|---|---|---|
John | Bonham | Led Zeppelin | Lennon | Guitar |
John Paul | Jones | Led Zeppelin | NA | NA |
Jimmy | Page | Led Zeppelin | Buffett | Guitar |
Jimmy | Page | Led Zeppelin | Page | Guitar |
Robert | Plant | Led Zeppelin | NA | NA |
George | Harrison | The Beatles | Harrison | Guitar |
John | Lennon | The Beatles | Lennon | Guitar |
Paul | McCartney | The Beatles | McCartney | Bass |
Paul | McCartney | The Beatles | Simon | Guitar |
Ringo | Starr | The Beatles | Starr | Drums |
Jimmy | Buffett | The Coral Reefers | Buffett | Guitar |
Jimmy | Buffett | The Coral Reefers | Page | Guitar |
Mick | Jagger | The Rolling Stones | Jagger | Vocals |
Keith | Richards | The Rolling Stones | Richards | Guitar |
Charlie | Watts | The Rolling Stones | NA | NA |
Ronnie | Wood | The Rolling Stones | NA | NA |
# Fix the code to recreate bands3left_join(bands, artists, by = c("first","last")) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
first | last | band | instrument |
---|---|---|---|
John | Bonham | Led Zeppelin | NA |
John Paul | Jones | Led Zeppelin | NA |
Jimmy | Page | Led Zeppelin | Guitar |
Robert | Plant | Led Zeppelin | NA |
George | Harrison | The Beatles | Guitar |
John | Lennon | The Beatles | Guitar |
Paul | McCartney | The Beatles | Bass |
Ringo | Starr | The Beatles | Drums |
Jimmy | Buffett | The Coral Reefers | Guitar |
Mick | Jagger | The Rolling Stones | Vocals |
Keith | Richards | The Rolling Stones | Guitar |
Charlie | Watts | The Rolling Stones | NA |
Ronnie | Wood | The Rolling Stones | NA |
- Notice how the
last
column is pulled in aslast.x
andlast.y
if not included in the join- And how all combination of rows are created for each name that is duplicated, e.g.john
- The easiest mistake when joining datasets is to use an incorrect combination of keys.
A right join
There is more than one way to execute a left join. Knowing multiple methods will make you a more versatile data scientist, especially as you try to fit joins into pipes created with %>%
.
In this exercise, you’ll recreate bands2 once more, but this time without using a left_join()
.
# Finish the code below to recreate bands3 with a right joinbands2 <- left_join(bands, artists, by = c("first", "last"))bands3 <- right_join(artists, bands, by = c("first", "last"))# Check that bands3 is equal to bands2setequal(bands2, bands3)
## [1] TRUE
left_join()
is equivalent to right_join()
with the order of the datasets reversed. Notice that each used the same by
argument.
Variations on joins
left_join
- prioritizes left datasetright_join
- prioritizes right datasetinner_join
- only retains rows in both datasetsfull_join
- retains all rows- Use
%>%
(pipes) to string together these joins
Inner joins and full joins
You may have noticed that some of the songs
in songs
correspond to some of the albums
in albums
. Suppose you want a new dataset that contains all of the songs
for which you have data from both albums
and songs
. How would you make it?
The artists
and bands
datasets also share some information. What if you want to join these two datasets in such a way that you retain all of the information available in both tables, without throwing anything away?
You can think of inner joins
as the most strict type of join: they only retain observations that appear in both datasets. In contrast, full joins
are the most permissive type of join: they return all of the data that appears in both datasets (often resulting in many missing values).
Recall that, *_join(x, y)
joins y
to x
. The second dataset you specify is joined to the first dataset.
# Join albums to songs using inner_join()songs %>% head() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
song | album | first | last |
---|---|---|---|
Come Together | Abbey Road | John | Lennon |
Dream On | Aerosmith | Steven | Tyler |
Hello, Goodbye | Magical Mystery Tour | Paul | McCartney |
It’s Not Unusual | Along Came Jones | Tom | Jones |
albums %>% head() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
album | band | year |
---|---|---|
A Hard Day’s Night | The Beatles | 1964 |
Magical Mystery Tour | The Beatles | 1967 |
Beggar’s Banquet | The Rolling Stones | 1968 |
Abbey Road | The Beatles | 1969 |
Led Zeppelin IV | Led Zeppelin | 1971 |
The Dark Side of the Moon | Pink Floyd | 1973 |
inner_join(songs, albums, by = "album") %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
song | album | first | last | band | year |
---|---|---|---|---|---|
Come Together | Abbey Road | John | Lennon | The Beatles | 1969 |
Dream On | Aerosmith | Steven | Tyler | Aerosmith | 1973 |
Hello, Goodbye | Magical Mystery Tour | Paul | McCartney | The Beatles | 1967 |
# Join bands to artists using full_join()artists %>% head() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
first | last | instrument |
---|---|---|
Jimmy | Buffett | Guitar |
George | Harrison | Guitar |
Mick | Jagger | Vocals |
Tom | Jones | Vocals |
Davy | Jones | Vocals |
John | Lennon | Guitar |
bands %>% head() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
first | last | band |
---|---|---|
John | Bonham | Led Zeppelin |
John Paul | Jones | Led Zeppelin |
Jimmy | Page | Led Zeppelin |
Robert | Plant | Led Zeppelin |
George | Harrison | The Beatles |
John | Lennon | The Beatles |
full_join(artists, bands, by = c("first","last")) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
first | last | instrument | band |
---|---|---|---|
Jimmy | Buffett | Guitar | The Coral Reefers |
George | Harrison | Guitar | The Beatles |
Mick | Jagger | Vocals | The Rolling Stones |
Tom | Jones | Vocals | NA |
Davy | Jones | Vocals | NA |
John | Lennon | Guitar | The Beatles |
Paul | McCartney | Bass | The Beatles |
Jimmy | Page | Guitar | Led Zeppelin |
Joe | Perry | Guitar | NA |
Elvis | Presley | Vocals | NA |
Keith | Richards | Guitar | The Rolling Stones |
Paul | Simon | Guitar | NA |
Ringo | Starr | Drums | The Beatles |
Joe | Walsh | Guitar | NA |
Brian | Wilson | Vocals | NA |
Nancy | Wilson | Vocals | NA |
John | Bonham | NA | Led Zeppelin |
John Paul | Jones | NA | Led Zeppelin |
Robert | Plant | NA | Led Zeppelin |
Charlie | Watts | NA | The Rolling Stones |
Ronnie | Wood | NA | The Rolling Stones |
Pipes
You can combine dplyr
functions together with the pipe operator, %>%
, to build up an analysis step-by-step. %>%
takes the result of the code that comes before it and “pipes” it into the function that comes after it as the first argument of the function.
So for example, the two pieces of code below do the same thing:
full_join(artists, bands, by = c("first", "last")) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")artists %>% full_join(bands, by = c("first", "last"))
Pipes are so efficient for multi-step analysis that you will use them for the remainder of the exercises in this course.
# Find guitarists in bands dataset (don't change)temp <- left_join(bands, artists, by = c("first", "last"))temp <- filter(temp, instrument == "Guitar")select(temp, first, last, band)
## # A tibble: 5 x 3## first last band ## <chr> <chr> <chr> ## 1 Jimmy Page Led Zeppelin ## 2 George Harrison The Beatles ## 3 John Lennon The Beatles ## 4 Jimmy Buffett The Coral Reefers ## 5 Keith Richards The Rolling Stones
# Reproduce code above using pipesbands %>% left_join(artists, by = c("first", "last")) %>% filter(instrument == "Guitar") %>% select(first, last, band)
## # A tibble: 5 x 3## first last band ## <chr> <chr> <chr> ## 1 Jimmy Page Led Zeppelin ## 2 George Harrison The Beatles ## 3 John Lennon The Beatles ## 4 Jimmy Buffett The Coral Reefers ## 5 Keith Richards The Rolling Stones
Practice with pipes and joins
We’ve created a data frame for you called goal
. Your mission, if you choose to accept it, is to create a dataset that’s identical to goal
.
# Examine the contents of the goal datasetgoal %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
first | last | instrument | band | song | album |
---|---|---|---|---|---|
Tom | Jones | Vocals | NA | It’s Not Unusual | Along Came Jones |
John | Lennon | Guitar | The Beatles | Come Together | Abbey Road |
Paul | McCartney | Bass | The Beatles | Hello, Goodbye | Magical Mystery Tour |
# Create goal2 using full_join() and inner_join() goal2 <- artists %>% full_join(bands, by = c("first","last")) %>% inner_join(songs, by = c("first","last")) # Check that goal and goal2 are the samesetequal(goal, goal2)
## [1] TRUE
Remember that the order of the joins in your pipe determines the output of the pipe.
Choose your joins
You’re getting the hang of pipes now! They are a very useful way to combine multiple joins to make a single dataset.
# Create one table that combines all informationartists %>% full_join(bands, by = c("first","last")) %>% full_join(songs, by = c("first","last")) %>% full_join(albums, by = c("album", "band")) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
first | last | instrument | band | song | album | year |
---|---|---|---|---|---|---|
Jimmy | Buffett | Guitar | The Coral Reefers | NA | NA | NA |
George | Harrison | Guitar | The Beatles | NA | NA | NA |
Mick | Jagger | Vocals | The Rolling Stones | NA | NA | NA |
Tom | Jones | Vocals | NA | It’s Not Unusual | Along Came Jones | NA |
Davy | Jones | Vocals | NA | NA | NA | NA |
John | Lennon | Guitar | The Beatles | Come Together | Abbey Road | 1969 |
Paul | McCartney | Bass | The Beatles | Hello, Goodbye | Magical Mystery Tour | 1967 |
Jimmy | Page | Guitar | Led Zeppelin | NA | NA | NA |
Joe | Perry | Guitar | NA | NA | NA | NA |
Elvis | Presley | Vocals | NA | NA | NA | NA |
Keith | Richards | Guitar | The Rolling Stones | NA | NA | NA |
Paul | Simon | Guitar | NA | NA | NA | NA |
Ringo | Starr | Drums | The Beatles | NA | NA | NA |
Joe | Walsh | Guitar | NA | NA | NA | NA |
Brian | Wilson | Vocals | NA | NA | NA | NA |
Nancy | Wilson | Vocals | NA | NA | NA | NA |
John | Bonham | NA | Led Zeppelin | NA | NA | NA |
John Paul | Jones | NA | Led Zeppelin | NA | NA | NA |
Robert | Plant | NA | Led Zeppelin | NA | NA | NA |
Charlie | Watts | NA | The Rolling Stones | NA | NA | NA |
Ronnie | Wood | NA | The Rolling Stones | NA | NA | NA |
Steven | Tyler | NA | NA | Dream On | Aerosmith | NA |
NA | NA | NA | The Beatles | NA | A Hard Day’s Night | 1964 |
NA | NA | NA | The Rolling Stones | NA | Beggar’s Banquet | 1968 |
NA | NA | NA | Led Zeppelin | NA | Led Zeppelin IV | 1971 |
NA | NA | NA | Pink Floyd | NA | The Dark Side of the Moon | 1973 |
NA | NA | NA | Aerosmith | NA | Aerosmith | 1973 |
NA | NA | NA | Fleetwood Mac | NA | Rumours | 1977 |
NA | NA | NA | Eagles | NA | Hotel California | 1982 |
Session info
sessionInfo()
## R version 3.5.2 (2018-12-20)## Platform: x86_64-w64-mingw32/x64 (64-bit)## Running under: Windows 10 x64 (build 16299)## ## Matrix products: default## ## locale:## [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252 ## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C ## [5] LC_TIME=German_Switzerland.1252 ## ## attached base packages:## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages:## [1] ggplot2_3.1.0 dplyr_0.8.0.1 gapminder_0.3.0 kableExtra_1.0.1## [5] knitr_1.21 ## ## loaded via a namespace (and not attached):## [1] Rcpp_1.0.0 highr_0.7 plyr_1.8.4 ## [4] pillar_1.3.1 compiler_3.5.2 prettydoc_0.2.1 ## [7] tools_3.5.2 digest_0.6.18 gtable_0.2.0 ## [10] evaluate_0.12 tibble_2.0.1 viridisLite_0.3.0## [13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1 ## [16] rstudioapi_0.9.0 yaml_2.2.0 xfun_0.4 ## [19] withr_2.1.2 httr_1.4.0 stringr_1.4.0 ## [22] xml2_1.2.0 hms_0.4.2 webshot_0.5.1 ## [25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.0 ## [28] R6_2.4.0 fansi_0.4.0 rmarkdown_1.11 ## [31] readr_1.3.1 purrr_0.3.0 magrittr_1.5 ## [34] codetools_0.2-15 scales_1.0.0 htmltools_0.3.6 ## [37] assertthat_0.2.0 rvest_0.3.2 colorspace_1.4-0 ## [40] utf8_1.1.4 stringi_1.3.1 lazyeval_0.2.1 ## [43] munsell_0.5.0 crayon_1.3.4
A work by Guillaume Abgrall
guillaume.abgrall@hotmail.fr