How to merge data in R using R merge, dplyr, or data.table

0
116


R has a number of quick, elegant ways to join data frames by a common column. I’d like to show you three of them:

  • base R’s merge() function,
  • dplyr’s join family of functions, and
  • data.table’s bracket syntax.

Get and import the data

For this example I’ll use one of my favorite demo data sets — flight delay times from the U.S. Bureau of Transportation Statistics. If you want to follow along, head to and download data for the time frame of your choice with the columns Flight Date, Reporting_Airline, Origin, Destination, and DepartureDelayMinutes. Also get the lookup table for Reporting_Airline.

Or, download these two data sets — plus my R code in a single file and a PowerPoint explaining different types of data merges — here:

Includes several data files, a PowerPoint, and R script to accompany InfoWorld article. Sharon Machlis

 

To read in the file with base R, I’d first unzip the flight delay file and then import both flight delay data and the code lookup file with read.csv(). If you’re running the code, the delay file you downloaded will likely have a different name than in the code below. Also, note the lookup file’s unusual .csv_ extension.

unzip("673598238_T_ONTIME_REPORTING.zip")
mydf <- read.csv("673598238_T_ONTIME_REPORTING.csv",
sep = ",", quote=""")
mylookup <- read.csv("L_UNIQUE_CARRIERS.csv_",
quote=""", sep = "," )

Next, I’ll take a peek at both files with head():

head(mydf)
     FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW  X
1 2019-08-01                DL    ATL  DFW            31 NA
2 2019-08-01                DL    DFW  ATL             0 NA
3 2019-08-01                DL    IAH  ATL            40 NA
4 2019-08-01                DL    PDX  SLC             0 NA
5 2019-08-01                DL    SLC  PDX             0 NA
6 2019-08-01                DL    DTW  ATL            10 NA

head(mylookup) Code Description 1 02Q Titan Airways 2 04Q Tradewind Aviation 3 05Q Comlux Aviation, AG 4 06Q Master Top Linhas Aereas Ltd. 5 07Q Flair Airlines Ltd. 6 09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern

Merges with base R

The mydf delay data frame only has airline information by code. I’d like to add a column with the airline names from mylookup. One base R way to do this is with the merge() function, using the basic syntax merge(df1, df2). It doesn’t matter the order of data frame 1 and data frame 2, but whichever one is first is considered x and the second one is y. 

IDG

A left join keeps all rows in the left data frame and only matching rows from the right data frame.

The code to import and merge both data sets using left_join() is below. It starts by loading the dplyr and readr packages and then reads in the two files with read_csv(). When using read_csv(), I don’t need to unzip the file first.

library(dplyr)
library(readr)
mytibble <- read_csv("673598238_T_ONTIME_REPORTING.zip")
mylookup_tibble <- read_csv("L_UNIQUE_CARRIERS.csv_")
joined_tibble <- left_join(mytibble, mylookup_tibble, 
by = c("OP_UNIQUE_CARRIER" = "Code"))

read_csv() creates tibbles, which are a type of data frame with some extra features. left_join() merges the two. Take a look at the syntax: In this case, order matters. left_join() means include all rows on the left, or first, data set, but only rows that match from the second one. And, because I need to join by two differently named columns, I included a by argument.

We can look at the structure of the result with dplyr’s glimpse() function, which is another way to see the top few items of a data frame.

glimpse(joined_tibble)
Observations: 658,461
Variables: 7
$ FL_DATE           <date> 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01…
$ OP_UNIQUE_CARRIER <chr> "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL",…
$ ORIGIN            <chr> "ATL", "DFW", "IAH", "PDX", "SLC", "DTW", "ATL", "MSP", "JF…
$ DEST              <chr> "DFW", "ATL", "ATL", "SLC", "PDX", "ATL", "DTW", "JFK", "MS…
$ DEP_DELAY_NEW     <dbl> 31, 0, 40, 0, 0, 10, 0, 22, 0, 0, 0, 17, 5, 2, 0, 0, 8, 0, …
$ X6                <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Description       <chr> "Delta Air Lines Inc.", "Delta Air Lines Inc.", "Delta Air …

This joined data set now has a new column with the name of the airline. If you run a version of this code yourself, you’ll probably notice that dplyr was way faster than base R.

Next, let’s look at a super-fast way to do joins.