Combining two datasets - Data literacy with R

Dataset 2 - Accessing and Cleaning

The ABS produces The Socio-Economic Indexes for Areas (SEIFA) based on the Census of Population and Housing. These indexes provide a measure of relative socio-economic advantage and disadvantage across different areas of Australia. It includes this dataset, which includes from Table 5 an Index of Education and Occupation, which reflects the education and occupational level per postcode. The dataset is also published under a suitable Creative Commons Licence.

Looking at the spreadsheet, the data we are interested in are the Postcode and Percentile/Ranking within Australia columns of the Table 5 sheet. The data starts on row 7. We can import this data directly and then discard all but columns 1 and 7 (Postcode and Percentile/Rank).

seifa2021_url <- 'https://www.abs.gov.au/statistics/people/people-and-communities/socio-economic-indexes-areas-seifa-australia/2021/Postal%20Area%2C%20Indexes%2C%20SEIFA%202021.xlsx'

download.file(seifa2021_url, 'poa_indexes.xlsx', mode = 'wb')

seifa2021_raw <- read_excel('poa_indexes.xlsx', sheet='Table 5', skip=6, n_max=2627, col_names=FALSE, .name_repair = 'minimal')[,c(1,7)]

names(seifa2021_raw) <- c('Postcode','ieo_percentile')

str(seifa2021_raw)


tibble [2,627 × 2] (S3: tbl_df/tbl/data.frame)
 $ Postcode      : chr [1:2627] "0800" "0810" "0812" "0820" ...
 $ ieo_percentile: num [1:2627] 88 79 54 85 10 57 2 40 61 57 ...

Combining datasets

Combining datasets offers great opportunities for data analysis and insights.

Our datasets both have postcode columns, so can be combined to compare education levels with income and private health status data.

R and the Tidyverse offer simple merge functions which can combine two datasets, based on common column(s) in both datasets.

We will use the inner_join() function. We only have one column in common; if there are more than one columns in common, they would simply need adding to the by parameter as a vector using the combine c() function.

Finally a note about cleaning - there may be differences in the postcode allocations between the SEIFA and Taxation data, we will leave this discrepancy for now and see the impact later in this workflow.

tax2022_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  mutate(TaxableIncome_dollarspr = TaxableIncome_dollars/Returns) %>% 
  mutate(PrivateHealth_percentpp = PrivateHealth_returns/Returns*100) %>% 
  inner_join( x= ., y = seifa2021_raw, by = "Postcode") %>%
  select(State, Postcode, ieo_percentile, TaxableIncome_dollarspr, PrivateHealth_percentpp ) %>%
  filter(Postcode == '6102')

A tibble: 1 × 5
State	Postcode	ieo_percentile	TaxableIncome_dollarspr	PrivateHealth_percentpp
<chr>	<chr>	<dbl>	<dbl>	<dbl>
WA	6102	69	56469.43	48.54591

A note about SQL

For those with a background in SQL, R interfaces with many different databases and queries can be executed as if using the native SQL interface.

In the example below, the library RSQLite is used to

create a SQLite database,

tax_seifa_db <- dbConnect(RSQLite::SQLite(), "")

refresh our memories of the tax2022_raw data

head(tax2022_raw)

A tibble: 6 × 5
State	Postcode	Returns	TaxableIncome_dollars	PrivateHealth_returns
<chr>	<chr>	<dbl>	<dbl>	<dbl>
ACT	2600	5951	791214764	4841
ACT	2601	3614	265604097	1965
ACT	2602	23085	2026942835	15791
ACT	2603	7558	1055186744	5926
ACT	2604	9137	953261746	6649
ACT	2605	8111	863964219	6298

refresh our memories of the seifa2021_raw data

head(seifa2021_raw)

A tibble: 6 × 2
Postcode	ieo_percentile
<chr>	<dbl>
0800	88
0810	79
0812	54
0820	85
0822	10
0828	57

write the tax data from the above inner_join command to a table,
write the seifa data from the above inner_join command to a table,

dbWriteTable(tax_seifa_db,"seifa2021_tbl",seifa2021_raw)
dbWriteTable(tax_seifa_db,"tax2022_tbl",tax2022_raw)
dbListTables(tax_seifa_db)

'seifa2021_tbl'
'tax2022_tbl'

execute a query using SELECT on the two tables to achieve the same output as the above inner_join() command,

dbGetQuery(tax_seifa_db, 'SELECT seifa2021_tbl.ieo_percentile, tax2022_tbl.* FROM seifa2021_tbl JOIN tax2022_tbl ON seifa2021_tbl."Postcode" = tax2022_tbl."Postcode" WHERE seifa2021_tbl."Postcode" = 6102')

A data.frame: 1 × 6
ieo_percentile	State	Postcode	Returns	TaxableIncome_dollars	PrivateHealth_returns
<dbl>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
69	WA	6102	9181	518445793	4457

and finally close the connection to the database.

dbDisconnect(tax_seifa_db)