The ABS produces The Socio-Economic Indexes for Areas (SEIFA) based on the Census of Population and Housing. These indexes provide a measure of relative socio-economic advantage and disadvantage across different areas of Australia. It includes this dataset, which includes from Table 5 an Index of Education and Occupation, which reflects the education and occupational level per postcode. The dataset is also published under a suitable Creative Commons Licence.
Looking at the spreadsheet, the data we are interested in are the Postcode and Percentile/Ranking within Australia columns of the Table 5 sheet. The data starts on row 7. We can import this data directly and then discard all but columns 1 and 7 (Postcode and Percentile/Rank).
seifa2021_url <- 'https://www.abs.gov.au/statistics/people/people-and-communities/socio-economic-indexes-areas-seifa-australia/2021/Postal%20Area%2C%20Indexes%2C%20SEIFA%202021.xlsx'
download.file(seifa2021_url, 'poa_indexes.xlsx', mode = 'wb')
seifa2021_raw <- read_excel('poa_indexes.xlsx', sheet='Table 5', skip=6, n_max=2627, col_names=FALSE, .name_repair = 'minimal')[,c(1,7)]
names(seifa2021_raw) <- c('Postcode','ieo_percentile')
str(seifa2021_raw)
tibble [2,627 × 2] (S3: tbl_df/tbl/data.frame)
$ Postcode : chr [1:2627] "0800" "0810" "0812" "0820" ...
$ ieo_percentile: num [1:2627] 88 79 54 85 10 57 2 40 61 57 ...
Combining datasets offers great opportunities for data analysis and insights.
Our datasets both have postcode columns, so can be combined to compare education levels with income and private health status data.
R and the Tidyverse offer simple merge functions which can combine two datasets, based on common column(s) in both datasets.
We will use the inner_join() function. We only have one column in common; if there are more than one columns in common, they would simply need adding to the by parameter as a vector using the combine c() function.
Finally a note about cleaning - there may be differences in the postcode allocations between the SEIFA and Taxation data, we will leave this discrepancy for now and see the impact later in this workflow.
tax2022_raw %>%
filter( State !="Unknown" & State!="Overseas" ) %>%
mutate(TaxableIncome_dollarspr = TaxableIncome_dollars/Returns) %>%
mutate(PrivateHealth_percentpp = PrivateHealth_returns/Returns*100) %>%
inner_join( x= ., y = seifa2021_raw, by = "Postcode") %>%
select(State, Postcode, ieo_percentile, TaxableIncome_dollarspr, PrivateHealth_percentpp ) %>%
filter(Postcode == '6102')
State | Postcode | ieo_percentile | TaxableIncome_dollarspr | PrivateHealth_percentpp |
---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> |
WA | 6102 | 69 | 56469.43 | 48.54591 |
For those with a background in SQL, R interfaces with many different databases and queries can be executed as if using the native SQL interface.
In the example below, the library RSQLite is used to
tax_seifa_db <- dbConnect(RSQLite::SQLite(), "")
head(tax2022_raw)
State | Postcode | Returns | TaxableIncome_dollars | PrivateHealth_returns |
---|---|---|---|---|
<chr> | <chr> | <dbl> | <dbl> | <dbl> |
ACT | 2600 | 5951 | 791214764 | 4841 |
ACT | 2601 | 3614 | 265604097 | 1965 |
ACT | 2602 | 23085 | 2026942835 | 15791 |
ACT | 2603 | 7558 | 1055186744 | 5926 |
ACT | 2604 | 9137 | 953261746 | 6649 |
ACT | 2605 | 8111 | 863964219 | 6298 |
head(seifa2021_raw)
Postcode | ieo_percentile |
---|---|
<chr> | <dbl> |
0800 | 88 |
0810 | 79 |
0812 | 54 |
0820 | 85 |
0822 | 10 |
0828 | 57 |
dbWriteTable(tax_seifa_db,"seifa2021_tbl",seifa2021_raw)
dbWriteTable(tax_seifa_db,"tax2022_tbl",tax2022_raw)
dbListTables(tax_seifa_db)
dbGetQuery(tax_seifa_db, 'SELECT seifa2021_tbl.ieo_percentile, tax2022_tbl.* FROM seifa2021_tbl JOIN tax2022_tbl ON seifa2021_tbl."Postcode" = tax2022_tbl."Postcode" WHERE seifa2021_tbl."Postcode" = 6102')
ieo_percentile | State | Postcode | Returns | TaxableIncome_dollars | PrivateHealth_returns |
---|---|---|---|---|---|
<dbl> | <chr> | <chr> | <dbl> | <dbl> | <dbl> |
69 | WA | 6102 | 9181 | 518445793 | 4457 |
dbDisconnect(tax_seifa_db)