Dataset 2 - Accessing and Cleaning

The ABS produces The Socio-Economic Indexes for Areas (SEIFA) based on the Census of Population and Housing. These indexes provide a measure of relative socio-economic advantage and disadvantage across different areas of Australia. It includes this dataset, which includes from Table 5 an Index of Education and Occupation, which reflects the education and occupational level per postcode. The dataset is also published under a suitable Creative Commons Licence.

Looking at the spreadsheet, the data we are interested in are the Postcode and Percentile/Ranking within Australia columns of the Table 5 sheet. The data starts on row 7. We can import this data directly and then discard all but columns 1 and 7 (Postcode and Percentile/Rank).

seifa2021_url <- 'https://www.abs.gov.au/statistics/people/people-and-communities/socio-economic-indexes-areas-seifa-australia/2021/Postal%20Area%2C%20Indexes%2C%20SEIFA%202021.xlsx'

download.file(seifa2021_url, 'poa_indexes.xlsx', mode = 'wb')

seifa2021_raw <- read_excel('poa_indexes.xlsx', sheet='Table 5', skip=6, n_max=2627, col_names=FALSE, .name_repair = 'minimal')[,c(1,7)]

names(seifa2021_raw) <- c('Postcode','ieo_percentile')

str(seifa2021_raw)

tibble [2,627 × 2] (S3: tbl_df/tbl/data.frame)
 $ Postcode      : chr [1:2627] "0800" "0810" "0812" "0820" ...
 $ ieo_percentile: num [1:2627] 88 79 54 85 10 57 2 40 61 57 ...

Combining datasets

Combining datasets offers great opportunities for data analysis and insights.

Our datasets both have postcode columns, so can be combined to compare education levels with income and private health status data.

R and the Tidyverse offer simple merge functions which can combine two datasets, based on common column(s) in both datasets.

We will use the inner_join() function. We only have one column in common; if there are more than one columns in common, they would simply need adding to the by parameter as a vector using the combine c() function.

Finally a note about cleaning - there may be differences in the postcode allocations between the SEIFA and Taxation data, we will leave this discrepancy for now and see the impact later in this workflow.

tax2022_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  mutate(TaxableIncome_dollarspr = TaxableIncome_dollars/Returns) %>% 
  mutate(PrivateHealth_percentpp = PrivateHealth_returns/Returns*100) %>% 
  inner_join( x= ., y = seifa2021_raw, by = "Postcode") %>%
  select(State, Postcode, ieo_percentile, TaxableIncome_dollarspr, PrivateHealth_percentpp ) %>%
  filter(Postcode == '6102')
A tibble: 1 × 5
State Postcode ieo_percentile TaxableIncome_dollarspr PrivateHealth_percentpp
<chr> <chr> <dbl> <dbl> <dbl>
WA 6102 69 56469.43 48.54591

A note about SQL

For those with a background in SQL, R interfaces with many different databases and queries can be executed as if using the native SQL interface.

In the example below, the library RSQLite is used to

  • create a SQLite database,
tax_seifa_db <- dbConnect(RSQLite::SQLite(), "")
  • refresh our memories of the tax2022_raw data
head(tax2022_raw)
A tibble: 6 × 5
State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
<chr> <chr> <dbl> <dbl> <dbl>
ACT 2600 5951 791214764 4841
ACT 2601 3614 265604097 1965
ACT 2602 23085 2026942835 15791
ACT 2603 7558 1055186744 5926
ACT 2604 9137 953261746 6649
ACT 2605 8111 863964219 6298
  • refresh our memories of the seifa2021_raw data
head(seifa2021_raw)
A tibble: 6 × 2
Postcode ieo_percentile
<chr> <dbl>
0800 88
0810 79
0812 54
0820 85
0822 10
0828 57
  • write the tax data from the above inner_join command to a table,
  • write the seifa data from the above inner_join command to a table,
dbWriteTable(tax_seifa_db,"seifa2021_tbl",seifa2021_raw)
dbWriteTable(tax_seifa_db,"tax2022_tbl",tax2022_raw)
dbListTables(tax_seifa_db)
  1. 'seifa2021_tbl'
  2. 'tax2022_tbl'
  • execute a query using SELECT on the two tables to achieve the same output as the above inner_join() command,
dbGetQuery(tax_seifa_db, 'SELECT seifa2021_tbl.ieo_percentile, tax2022_tbl.* FROM seifa2021_tbl JOIN tax2022_tbl ON seifa2021_tbl."Postcode" = tax2022_tbl."Postcode" WHERE seifa2021_tbl."Postcode" = 6102')
A data.frame: 1 × 6
ieo_percentile State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
69 WA 6102 9181 518445793 4457
  • and finally close the connection to the database.
dbDisconnect(tax_seifa_db)