Dataset 2 - Accessing and Cleaning

The ABS produces The Socio-Economic Indexes for Areas (SEIFA) based on the Census of Population and Housing. These indexes provide a measure of relative socio-economic advantage and disadvantage across different areas of Australia. It includes this dataset, which includes from Table 5 an Index of Education and Occupation, which reflects the education and occupational level per postcode. The dataset is also published under a suitable Creative Commons Licence.

Looking at the spreadsheet, the data we are interested in are the Postcode and Percentile/Ranking within Australia columns of the Table 5 sheet. The data starts on row 7. We can import this data directly and then discard all but columns 0 and 6 (Postcode and Percentile/Rank).

seifa2021_url = 'https://www.abs.gov.au/statistics/people/people-and-communities/socio-economic-indexes-areas-seifa-australia/2021/Postal%20Area%2C%20Indexes%2C%20SEIFA%202021.xlsx'

seifa2021_raw = pd.read_excel(seifa2021_url , sheet_name='Table 5', skiprows=5, nrows=2627, usecols=[0,6], dtype=object)

seifa2021_raw.columns = ['Postcode','ieo_percentile']

seifa2021_raw["ieo_percentile"] = pd.to_numeric(seifa2021_raw["ieo_percentile"]).astype('Int64')

seifa2021_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2627 entries, 0 to 2626
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Postcode        2627 non-null   object
 1   ieo_percentile  2627 non-null   Int64 
dtypes: Int64(1), object(1)
memory usage: 43.7+ KB

Combining datasets

Combining datasets offers great opportunities for data analysis and insights.

Our datasets both have postcode columns, so can be combined to compare education levels with income and private health status data.

Python/Pandas offer simple merge functions which can combine two datasets, based on common column(s) in both datasets.

We will use the merge method. We only have one column in common; if there are more than one columns in common, they would simply need adding to the by parameter as a list.

Finally a note about cleaning - there may be differences in the postcode allocations between the SEIFA and Taxation data, we will leave this discrepancy for now and see the impact later in this workflow.

(tax2022_raw.query('~State.isin(["Unknown","Overseas"])')
            .assign(TaxableIncome_dollarspr = lambda x: x.TaxableIncome_dollars/x.Returns)
            .assign(PrivateHealth_percentpp = lambda x: x.PrivateHealth_returns/x.Returns*100)
            .merge(seifa2021_raw, how="inner", on="Postcode")
            .loc[:, ["State", "Postcode", "ieo_percentile", "TaxableIncome_dollarspr", "PrivateHealth_percentpp"]]
            .query('Postcode=="6102"'))
State Postcode ieo_percentile TaxableIncome_dollarspr PrivateHealth_percentpp
2287 WA 6102 69 56469.425226 48.54591

A note about SQL

For those with a background in SQL, R interfaces with many different databases and queries can be executed as if using the native SQL interface.

In the example below, the package RSQLite is used to

  • create a SQLite database,
conn = sqlite3.connect('tax_seifa.db', isolation_level = None)
  • refresh our memories of the tax2022_raw data
tax2022_raw.head(5)
State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
0 ACT 2600 5951 791214764 4841
1 ACT 2601 3614 265604097 1965
2 ACT 2602 23085 2026942835 15791
3 ACT 2603 7558 1055186744 5926
4 ACT 2604 9137 953261746 6649
  • refresh our memories of the seifa2021_raw data
seifa2021_raw.head(5)
Postcode ieo_percentile
0 0800 88
1 0810 79
2 0812 54
3 0820 85
4 0822 10
  • write the tax data from the above merge() method to a table,
  • write the seifa data from the above merge() method to a table,
seifa2021_raw.to_sql("seifa2021_tbl",conn, if_exists='replace', method='multi')
tax2022_raw.to_sql("tax2022_tbl",conn, if_exists='replace', method='multi')
2639
  • execute a query using SELECT on the two tables to achieve the same output as the above merge() command.
pd.read_sql_query('SELECT seifa2021_tbl.ieo_percentile, tax2022_tbl.* \
                   FROM seifa2021_tbl \
                   JOIN tax2022_tbl ON seifa2021_tbl."Postcode" = tax2022_tbl."Postcode" \
                   WHERE seifa2021_tbl."Postcode" = 6102', conn)
ieo_percentile index State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
0 69 2355 WA 6102 9181 518445793 4457

and finally close the connection to the database.

conn.close()