Curtin logo

Introduction to R - Data Literacy

Version 1.8 - July 2024

COPYRIGHT © Curtin University 2024

Workshop 1 - Part 4 - Combining two datasets

Dataset 2 - Accessing and Cleaning

The ABS produces The Socio-Economic Indexes for Areas (SEIFA) based on the Census of Population and Housing. These indexes provide a measure of relative socio-economic advantage and disadvantage across different areas of Australia. It includes this dataset, which includes from Table 5 an Index of Education and Occupation, which reflects the education and occupational level per postcode. The dataset is also published under a suitable Creative Commons Licence.

Looking at the spreadsheet, the data we are interested in are the Postcode and Percentile/Ranking within Australia columns of the Table 5 sheet. The data starts on row 7. We can import this data direct using the readxl library. With this library we import all columns then discard all but 1 and 7 (Postcode and Percentile/Rank).

The import warns about some cell values however these columns are discarded so not an issue for us.

seifa2016_url <- 'https://www.ausstats.abs.gov.au/ausstats/subscriber.nsf/0/DC124D1DAC3D9FDDCA25825D000F9267/$File/2033055001%20-%20poa%20indexes.xls'

download.file(seifa2016_url, 'poa_indexes.xls', mode = 'wb')

seifa2016_raw <- read_excel('poa_indexes.xls', sheet='Table 5', range=cell_rows(7:2636), col_names=FALSE, na = ' ', .name_repair = 'minimal')[,c(1,7)]

names(seifa2016_raw) <- c('Postcode','ieo_percentile')

seifa2016_raw$Postcode <- as.numeric(seifa2016_raw$Postcode)

head(seifa2016_raw)
## # A tibble: 6 × 2
##   Postcode ieo_percentile
##      <dbl>          <dbl>
## 1     2000             94
## 2     2007             93
## 3     2008             99
## 4     2009             96
## 5     2010             97
## 6     2011             98

Combining datasets

Combining datasets offers great opportunities for data analysis and insights.

Our datasets both have postcode columns, so can be combined to compare education levels with income and private health status data.

R and the Tidyverse offer simple merge functions which can combine two datasets, based on common column(s) in both datasets.

We will use the inner_join() function. We only have one column in common; if there are more than one columns in common, they would simply need adding to the by parameter as a vector using the combine c() function.

Finally a note about cleaning - there may have been changes in the postcode allocations between the SEIFA data of 2016 and the Taxation data from 2019/20, we will leave this discrepancy for now and see the impact later in this workflow. (Hint: there are at least 10 changes!)

# Change variable type for Postcode column as excel import chooses it to be 'character' type
tax2020_raw$Postcode <- as.numeric(tax2020_raw$Postcode)

tax2020_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  mutate(TaxableIncome_dollarspr = TaxableIncome_dollars/Returns) %>% 
  mutate(PrivateHealth_percentpp = round(PrivateHealth_returns/Returns*100,0)) %>% 
  inner_join( x= ., y = seifa2016_raw, by = "Postcode") %>%
  select(State, Postcode, ieo_percentile, TaxableIncome_dollarspr, PrivateHealth_percentpp ) %>%
  filter(Postcode == '6102')
## # A tibble: 1 × 5
##   State Postcode ieo_percentile TaxableIncome_dollarspr PrivateHealth_percentpp
##   <chr>    <dbl>          <dbl>                   <dbl>                   <dbl>
## 1 WA        6102             65                  48324.                      48

A note about SQL

For those with a background in SQL, R interfaces with many different databases and queries can be executed as if using the native SQL interface.

In the example below, the library RSQLite is used to

  • create a SQLite database,
  • write the tax data from the above inner_join command to a table,
  • write the seifa data from the above inner_join command to a table,
  • execute a query using SELECT on the two tables to achieve the same output as the above inner_join() command.
tax_seifa_db <- dbConnect(RSQLite::SQLite(), "")
head(tax2020_raw)
## # A tibble: 6 × 5
##   State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
##   <chr>    <dbl>   <dbl>                 <dbl>                 <dbl>
## 1 ACT       2600    5945             710218557                  4822
## 2 ACT       2601    3159             214621509                  1670
## 3 ACT       2602   22009            1747368144                 14747
## 4 ACT       2603    7165             906618402                  5617
## 5 ACT       2604    8617             821838536                  6118
## 6 ACT       2605    7973             748963372                  6126
head(seifa2016_raw)
## # A tibble: 6 × 2
##   Postcode ieo_percentile
##      <dbl>          <dbl>
## 1     2000             94
## 2     2007             93
## 3     2008             99
## 4     2009             96
## 5     2010             97
## 6     2011             98
dbWriteTable(tax_seifa_db,"seifa2016_tbl",seifa2016_raw)
dbWriteTable(tax_seifa_db,"tax2020_tbl",tax2020_raw)
dbListTables(tax_seifa_db)
## [1] "seifa2016_tbl" "tax2020_tbl"
dbGetQuery(tax_seifa_db, 'SELECT seifa2016_tbl.ieo_percentile, tax2020_tbl.* FROM seifa2016_tbl JOIN tax2020_tbl ON seifa2016_tbl."Postcode" = tax2020_tbl."Postcode" WHERE seifa2016_tbl."Postcode" = 6102')
##   ieo_percentile State Postcode Returns TaxableIncome_dollars
## 1             65    WA     6102    7899             381711165
##   PrivateHealth_returns
## 1                  3814
dbDisconnect(tax_seifa_db)

Previous

Part 3 - Data analysis and visualisation

Next

Part 5 - Map visualisations

Curtin logo

Version 1.8 - July 2024

COPYRIGHT © Curtin University 2024