Combining two datasets - Data literacy with Python

Dataset 2 - Accessing and Cleaning

The ABS produces The Socio-Economic Indexes for Areas (SEIFA) based on the Census of Population and Housing. These indexes provide a measure of relative socio-economic advantage and disadvantage across different areas of Australia. It includes this dataset, which includes from Table 5 an Index of Education and Occupation, which reflects the education and occupational level per postcode. The dataset is also published under a suitable Creative Commons Licence.

Looking at the spreadsheet, the data we are interested in are the Postcode and Percentile/Ranking within Australia columns of the Table 5 sheet. The data starts on row 7. We can import this data directly and then discard all but columns 0 and 6 (Postcode and Percentile/Rank).

seifa2021_url = 'https://www.abs.gov.au/statistics/people/people-and-communities/socio-economic-indexes-areas-seifa-australia/2021/Postal%20Area%2C%20Indexes%2C%20SEIFA%202021.xlsx'

seifa2021_raw = pd.read_excel(seifa2021_url , sheet_name='Table 5', skiprows=5, nrows=2627, usecols=[0,6], dtype=object)

seifa2021_raw.columns = ['Postcode','ieo_percentile']

seifa2021_raw["ieo_percentile"] = pd.to_numeric(seifa2021_raw["ieo_percentile"]).astype('Int64')

seifa2021_raw.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2627 entries, 0 to 2626
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Postcode        2627 non-null   object
 1   ieo_percentile  2627 non-null   Int64 
dtypes: Int64(1), object(1)
memory usage: 43.7+ KB

Combining datasets

Combining datasets offers great opportunities for data analysis and insights.

Our datasets both have postcode columns, so can be combined to compare education levels with income and private health status data.

Python/Pandas offer simple merge functions which can combine two datasets, based on common column(s) in both datasets.

We will use the merge method. We only have one column in common; if there are more than one columns in common, they would simply need adding to the by parameter as a list.

Finally a note about cleaning - there may be differences in the postcode allocations between the SEIFA and Taxation data, we will leave this discrepancy for now and see the impact later in this workflow.

(tax2022_raw.query('~State.isin(["Unknown","Overseas"])')
            .assign(TaxableIncome_dollarspr = lambda x: x.TaxableIncome_dollars/x.Returns)
            .assign(PrivateHealth_percentpp = lambda x: x.PrivateHealth_returns/x.Returns*100)
            .merge(seifa2021_raw, how="inner", on="Postcode")
            .loc[:, ["State", "Postcode", "ieo_percentile", "TaxableIncome_dollarspr", "PrivateHealth_percentpp"]]
            .query('Postcode=="6102"'))

	State	Postcode	ieo_percentile	TaxableIncome_dollarspr	PrivateHealth_percentpp
2287	WA	6102	69	56469.425226	48.54591

A note about SQL

For those with a background in SQL, R interfaces with many different databases and queries can be executed as if using the native SQL interface.

In the example below, the package RSQLite is used to

create a SQLite database,

conn = sqlite3.connect('tax_seifa.db', isolation_level = None)

refresh our memories of the tax2022_raw data

tax2022_raw.head(5)

	State	Postcode	Returns	TaxableIncome_dollars	PrivateHealth_returns
0	ACT	2600	5951	791214764	4841
1	ACT	2601	3614	265604097	1965
2	ACT	2602	23085	2026942835	15791
3	ACT	2603	7558	1055186744	5926
4	ACT	2604	9137	953261746	6649

refresh our memories of the seifa2021_raw data

seifa2021_raw.head(5)

	Postcode	ieo_percentile
0	0800	88
1	0810	79
2	0812	54
3	0820	85
4	0822	10

write the tax data from the above merge() method to a table,
write the seifa data from the above merge() method to a table,

seifa2021_raw.to_sql("seifa2021_tbl",conn, if_exists='replace', method='multi')
tax2022_raw.to_sql("tax2022_tbl",conn, if_exists='replace', method='multi')

execute a query using SELECT on the two tables to achieve the same output as the above merge() command.

pd.read_sql_query('SELECT seifa2021_tbl.ieo_percentile, tax2022_tbl.* \
                   FROM seifa2021_tbl \
                   JOIN tax2022_tbl ON seifa2021_tbl."Postcode" = tax2022_tbl."Postcode" \
                   WHERE seifa2021_tbl."Postcode" = 6102', conn)

	ieo_percentile	index	State	Postcode	Returns	TaxableIncome_dollars	PrivateHealth_returns
0	69	2355	WA	6102	9181	518445793	4457

and finally close the connection to the database.

conn.close()