Joining Data in R with dplyr (2024)

Guillaume Abgrall

2019-04-25

  • Whats Covered
  • Aditional Resources
  • Mutating Joins
    • Welcome to the course!
    • Keys
      • Primary Keys
      • Secondary Keys
      • Multi-variable keys
    • Joins
      • A basic join
      • A second join
      • A right join
    • Variations on joins
      • Inner joins and full joins
      • Pipes
      • Practice with pipes and joins
      • Choose your joins
    • Session info

Whats Covered

  • Mutating joins
  • Filtering joins and set operations
  • Assembling data
  • Advanced joining
  • Case Study

Welcome to the course!

  • Advantages to dplyr over base R merge function for joing data
    • dplyr always preserves the row order
    • dplyr has much more intuitive syntax
    • dplyr can be applied to databases, or spark
  • dplyr is a front end language for maniulating data that can be converted to multiple backends like SQL or spark.
    • This is pretty useful. Everything we cover in this class could work on a spark dataframe or a sql table

Keys

  • The Primary key needs to be unique in a table
  • The foreign key in the second table can be duplicated
  • second table will be matched to the primary table based on the primary key
  • The primary key may be one, two or even more columns in the table

Primary Keys

  • What is the primary key in artists dataset?
    • name
artists0 %>%  head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
nameinstrument
Jimmy BuffettGuitar
George HarrisonGuitar
Mick JaggerVocals
Tom JonesVocals
Davy JonesVocals
John LennonGuitar
Paul McCartneyBass
Jimmy PageGuitar
Joe PerryGuitar
Elvis PresleyVocals

Secondary Keys

  • Which datasets have a primary key that matches artists\(name? + `bands\)name+songs$writer`
albums0 %>%  head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
namebandyear
A Hard Day’s NightThe Beatles1964
Magical Mystery TourThe Beatles1967
Beggar’s BanquetThe Rolling Stones1968
Abbey RoadThe Beatles1969
Led Zeppelin IVLed Zeppelin1971
The Dark Side of the MoonPink Floyd1973
AerosmithAerosmith1973
RumoursFleetwood Mac1977
Hotel CaliforniaEagles1982
bands0 %>%  head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
nameband
John BonhamLed Zeppelin
John Paul JonesLed Zeppelin
Jimmy PageLed Zeppelin
Robert PlantLed Zeppelin
George HarrisonThe Beatles
John LennonThe Beatles
Paul McCartneyThe Beatles
Ringo StarrThe Beatles
Jimmy BuffettThe Coral Reefers
Mick JaggerThe Rolling Stones
songs0 %>%  head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
songalbumwriter
Come TogetherAbbey RoadJohn Lennon
Dream OnAerosmithSteven Tyler
Hello, GoodbyeMagical Mystery TourPaul McCartney
It’s Not UnusualAlong Came JonesTom Jones

Multi-variable keys

  • What is the primary key in artists?
    • The combination of first and last
artists %>%  head(10) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
firstlastinstrument
JimmyBuffettGuitar
GeorgeHarrisonGuitar
MickJaggerVocals
TomJonesVocals
DavyJonesVocals
JohnLennonGuitar
PaulMcCartneyBass
JimmyPageGuitar
JoePerryGuitar
ElvisPresleyVocals

The combination of first and last uniquely identifies each person in the dataset, just as the combination of first and last names tend to uniquely identify people in real life.

Joins

  • Joins in dplyr will work on data frames, tibbles (tbl_df), and tbl references

    Joining Data in R with dplyr (1)

A basic join

As Garrett mentioned in the video, left_join() is the basic join function in dplyr. You can use it whenever you want to augment a data frame with information from another data frame.

For example, left_join(x, y) joins y to x. The second dataset you specify is joined to the first dataset.

# Complete the code to join artists to bandsbands2 <- left_join(bands, artists, by = c('first','last'))# Examine the resultsbands2 %>%  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
firstlastbandinstrument
JohnBonhamLed ZeppelinNA
John PaulJonesLed ZeppelinNA
JimmyPageLed ZeppelinGuitar
RobertPlantLed ZeppelinNA
GeorgeHarrisonThe BeatlesGuitar
JohnLennonThe BeatlesGuitar
PaulMcCartneyThe BeatlesBass
RingoStarrThe BeatlesDrums
JimmyBuffettThe Coral ReefersGuitar
MickJaggerThe Rolling StonesVocals
KeithRichardsThe Rolling StonesGuitar
CharlieWattsThe Rolling StonesNA
RonnieWoodThe Rolling StonesNA

A second join

You should always check the output of your joins. Just because a join worked, doesn’t mean that it worked as you expected.

For example, the code in the editor joins the same two datasets that you joined in the previous exercise, but it returns a different result. Can you tell what went wrong?

left_join(bands, artists, by = "first") %>%  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
firstlast.xbandlast.yinstrument
JohnBonhamLed ZeppelinLennonGuitar
John PaulJonesLed ZeppelinNANA
JimmyPageLed ZeppelinBuffettGuitar
JimmyPageLed ZeppelinPageGuitar
RobertPlantLed ZeppelinNANA
GeorgeHarrisonThe BeatlesHarrisonGuitar
JohnLennonThe BeatlesLennonGuitar
PaulMcCartneyThe BeatlesMcCartneyBass
PaulMcCartneyThe BeatlesSimonGuitar
RingoStarrThe BeatlesStarrDrums
JimmyBuffettThe Coral ReefersBuffettGuitar
JimmyBuffettThe Coral ReefersPageGuitar
MickJaggerThe Rolling StonesJaggerVocals
KeithRichardsThe Rolling StonesRichardsGuitar
CharlieWattsThe Rolling StonesNANA
RonnieWoodThe Rolling StonesNANA
# Fix the code to recreate bands3left_join(bands, artists, by = c("first","last")) %>%  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
firstlastbandinstrument
JohnBonhamLed ZeppelinNA
John PaulJonesLed ZeppelinNA
JimmyPageLed ZeppelinGuitar
RobertPlantLed ZeppelinNA
GeorgeHarrisonThe BeatlesGuitar
JohnLennonThe BeatlesGuitar
PaulMcCartneyThe BeatlesBass
RingoStarrThe BeatlesDrums
JimmyBuffettThe Coral ReefersGuitar
MickJaggerThe Rolling StonesVocals
KeithRichardsThe Rolling StonesGuitar
CharlieWattsThe Rolling StonesNA
RonnieWoodThe Rolling StonesNA
  • Notice how the last column is pulled in as last.x and last.y if not included in the join
    • And how all combination of rows are created for each name that is duplicated, e.g.john
  • The easiest mistake when joining datasets is to use an incorrect combination of keys.

A right join

There is more than one way to execute a left join. Knowing multiple methods will make you a more versatile data scientist, especially as you try to fit joins into pipes created with %>%.

In this exercise, you’ll recreate bands2 once more, but this time without using a left_join().

# Finish the code below to recreate bands3 with a right joinbands2 <- left_join(bands, artists, by = c("first", "last"))bands3 <- right_join(artists, bands, by = c("first", "last"))# Check that bands3 is equal to bands2setequal(bands2, bands3)
## [1] TRUE

left_join() is equivalent to right_join() with the order of the datasets reversed. Notice that each used the same by argument.

Variations on joins

  • left_join - prioritizes left dataset
  • right_join - prioritizes right dataset
  • inner_join - only retains rows in both datasets
  • full_join - retains all rows
  • Use %>% (pipes) to string together these joins

Inner joins and full joins

You may have noticed that some of the songs in songs correspond to some of the albums in albums. Suppose you want a new dataset that contains all of the songs for which you have data from both albums and songs. How would you make it?

The artists and bandsdatasets also share some information. What if you want to join these two datasets in such a way that you retain all of the information available in both tables, without throwing anything away?

You can think of inner joins as the most strict type of join: they only retain observations that appear in both datasets. In contrast, full joins are the most permissive type of join: they return all of the data that appears in both datasets (often resulting in many missing values).

Recall that, *_join(x, y) joins y to x. The second dataset you specify is joined to the first dataset.

# Join albums to songs using inner_join()songs %>%  head() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
songalbumfirstlast
Come TogetherAbbey RoadJohnLennon
Dream OnAerosmithStevenTyler
Hello, GoodbyeMagical Mystery TourPaulMcCartney
It’s Not UnusualAlong Came JonesTomJones
albums %>%  head() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
albumbandyear
A Hard Day’s NightThe Beatles1964
Magical Mystery TourThe Beatles1967
Beggar’s BanquetThe Rolling Stones1968
Abbey RoadThe Beatles1969
Led Zeppelin IVLed Zeppelin1971
The Dark Side of the MoonPink Floyd1973
inner_join(songs, albums, by = "album") %>%  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
songalbumfirstlastbandyear
Come TogetherAbbey RoadJohnLennonThe Beatles1969
Dream OnAerosmithStevenTylerAerosmith1973
Hello, GoodbyeMagical Mystery TourPaulMcCartneyThe Beatles1967
# Join bands to artists using full_join()artists %>%  head() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
firstlastinstrument
JimmyBuffettGuitar
GeorgeHarrisonGuitar
MickJaggerVocals
TomJonesVocals
DavyJonesVocals
JohnLennonGuitar
bands %>%  head() %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
firstlastband
JohnBonhamLed Zeppelin
John PaulJonesLed Zeppelin
JimmyPageLed Zeppelin
RobertPlantLed Zeppelin
GeorgeHarrisonThe Beatles
JohnLennonThe Beatles
full_join(artists, bands, by = c("first","last")) %>%  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
firstlastinstrumentband
JimmyBuffettGuitarThe Coral Reefers
GeorgeHarrisonGuitarThe Beatles
MickJaggerVocalsThe Rolling Stones
TomJonesVocalsNA
DavyJonesVocalsNA
JohnLennonGuitarThe Beatles
PaulMcCartneyBassThe Beatles
JimmyPageGuitarLed Zeppelin
JoePerryGuitarNA
ElvisPresleyVocalsNA
KeithRichardsGuitarThe Rolling Stones
PaulSimonGuitarNA
RingoStarrDrumsThe Beatles
JoeWalshGuitarNA
BrianWilsonVocalsNA
NancyWilsonVocalsNA
JohnBonhamNALed Zeppelin
John PaulJonesNALed Zeppelin
RobertPlantNALed Zeppelin
CharlieWattsNAThe Rolling Stones
RonnieWoodNAThe Rolling Stones

Pipes

You can combine dplyr functions together with the pipe operator, %>%, to build up an analysis step-by-step. %>% takes the result of the code that comes before it and “pipes” it into the function that comes after it as the first argument of the function.

So for example, the two pieces of code below do the same thing:

full_join(artists, bands, by = c("first", "last")) %>%  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")artists %>%  full_join(bands, by = c("first", "last"))

Pipes are so efficient for multi-step analysis that you will use them for the remainder of the exercises in this course.

# Find guitarists in bands dataset (don't change)temp <- left_join(bands, artists, by = c("first", "last"))temp <- filter(temp, instrument == "Guitar")select(temp, first, last, band)
## # A tibble: 5 x 3## first last band ## <chr> <chr> <chr> ## 1 Jimmy Page Led Zeppelin ## 2 George Harrison The Beatles ## 3 John Lennon The Beatles ## 4 Jimmy Buffett The Coral Reefers ## 5 Keith Richards The Rolling Stones
# Reproduce code above using pipesbands %>%  left_join(artists, by = c("first", "last")) %>% filter(instrument == "Guitar") %>% select(first, last, band)
## # A tibble: 5 x 3## first last band ## <chr> <chr> <chr> ## 1 Jimmy Page Led Zeppelin ## 2 George Harrison The Beatles ## 3 John Lennon The Beatles ## 4 Jimmy Buffett The Coral Reefers ## 5 Keith Richards The Rolling Stones

Practice with pipes and joins

We’ve created a data frame for you called goal. Your mission, if you choose to accept it, is to create a dataset that’s identical to goal.

# Examine the contents of the goal datasetgoal %>%  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
firstlastinstrumentbandsongalbum
TomJonesVocalsNAIt’s Not UnusualAlong Came Jones
JohnLennonGuitarThe BeatlesCome TogetherAbbey Road
PaulMcCartneyBassThe BeatlesHello, GoodbyeMagical Mystery Tour
# Create goal2 using full_join() and inner_join() goal2 <- artists %>% full_join(bands, by = c("first","last")) %>% inner_join(songs, by = c("first","last")) # Check that goal and goal2 are the samesetequal(goal, goal2)
## [1] TRUE

Remember that the order of the joins in your pipe determines the output of the pipe.

Choose your joins

You’re getting the hang of pipes now! They are a very useful way to combine multiple joins to make a single dataset.

# Create one table that combines all informationartists %>% full_join(bands, by = c("first","last")) %>% full_join(songs, by = c("first","last")) %>% full_join(albums, by = c("album", "band")) %>%  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>% row_spec(0, bold = T, color = "white", background = "#3f7689")
firstlastinstrumentbandsongalbumyear
JimmyBuffettGuitarThe Coral ReefersNANANA
GeorgeHarrisonGuitarThe BeatlesNANANA
MickJaggerVocalsThe Rolling StonesNANANA
TomJonesVocalsNAIt’s Not UnusualAlong Came JonesNA
DavyJonesVocalsNANANANA
JohnLennonGuitarThe BeatlesCome TogetherAbbey Road1969
PaulMcCartneyBassThe BeatlesHello, GoodbyeMagical Mystery Tour1967
JimmyPageGuitarLed ZeppelinNANANA
JoePerryGuitarNANANANA
ElvisPresleyVocalsNANANANA
KeithRichardsGuitarThe Rolling StonesNANANA
PaulSimonGuitarNANANANA
RingoStarrDrumsThe BeatlesNANANA
JoeWalshGuitarNANANANA
BrianWilsonVocalsNANANANA
NancyWilsonVocalsNANANANA
JohnBonhamNALed ZeppelinNANANA
John PaulJonesNALed ZeppelinNANANA
RobertPlantNALed ZeppelinNANANA
CharlieWattsNAThe Rolling StonesNANANA
RonnieWoodNAThe Rolling StonesNANANA
StevenTylerNANADream OnAerosmithNA
NANANAThe BeatlesNAA Hard Day’s Night1964
NANANAThe Rolling StonesNABeggar’s Banquet1968
NANANALed ZeppelinNALed Zeppelin IV1971
NANANAPink FloydNAThe Dark Side of the Moon1973
NANANAAerosmithNAAerosmith1973
NANANAFleetwood MacNARumours1977
NANANAEaglesNAHotel California1982

Session info

sessionInfo()
## R version 3.5.2 (2018-12-20)## Platform: x86_64-w64-mingw32/x64 (64-bit)## Running under: Windows 10 x64 (build 16299)## ## Matrix products: default## ## locale:## [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252 ## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C ## [5] LC_TIME=German_Switzerland.1252 ## ## attached base packages:## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages:## [1] ggplot2_3.1.0 dplyr_0.8.0.1 gapminder_0.3.0 kableExtra_1.0.1## [5] knitr_1.21 ## ## loaded via a namespace (and not attached):## [1] Rcpp_1.0.0 highr_0.7 plyr_1.8.4 ## [4] pillar_1.3.1 compiler_3.5.2 prettydoc_0.2.1 ## [7] tools_3.5.2 digest_0.6.18 gtable_0.2.0 ## [10] evaluate_0.12 tibble_2.0.1 viridisLite_0.3.0## [13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1 ## [16] rstudioapi_0.9.0 yaml_2.2.0 xfun_0.4 ## [19] withr_2.1.2 httr_1.4.0 stringr_1.4.0 ## [22] xml2_1.2.0 hms_0.4.2 webshot_0.5.1 ## [25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.0 ## [28] R6_2.4.0 fansi_0.4.0 rmarkdown_1.11 ## [31] readr_1.3.1 purrr_0.3.0 magrittr_1.5 ## [34] codetools_0.2-15 scales_1.0.0 htmltools_0.3.6 ## [37] assertthat_0.2.0 rvest_0.3.2 colorspace_1.4-0 ## [40] utf8_1.1.4 stringi_1.3.1 lazyeval_0.2.1 ## [43] munsell_0.5.0 crayon_1.3.4

A work by Guillaume Abgrall

guillaume.abgrall@hotmail.fr

Joining Data in R with dplyr (2024)

References

Top Articles
Latest Posts
Article information

Author: Moshe Kshlerin

Last Updated:

Views: 6296

Rating: 4.7 / 5 (77 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Moshe Kshlerin

Birthday: 1994-01-25

Address: Suite 609 315 Lupita Unions, Ronnieburgh, MI 62697

Phone: +2424755286529

Job: District Education Designer

Hobby: Yoga, Gunsmithing, Singing, 3D printing, Nordic skating, Soapmaking, Juggling

Introduction: My name is Moshe Kshlerin, I am a gleaming, attractive, outstanding, pleasant, delightful, outstanding, famous person who loves writing and wants to share my knowledge and understanding with you.