Curtin logo

Introduction to R - Data Literacy

Version 1.8 - July 2024

COPYRIGHT © Curtin University 2024

Workshop 1 - Part 3 - Real world data analysis and visualisation

Real World Datasets

There are many sources of data on the internet. Governments make public sector data available for activities such as Hackathons, allowing diverse groups of people to provide innovative solutions for communities.

For example, the West Australian State Government has the site Data WA, with over 2000 datasets.

The Australian Government has the site data.gov.au, with over 100,000 datasets.

Another good source is the Australian Bureau of Statistics (ABS)

Dataset 1 - Accessing and Cleaning

Our first dataset is Australian Taxation Statistics 2019-2020, in particular Table 6B which gives summary tax details for individual returns by postcode.

This dataset can be accessed directly from R, it is an Excel xlsx file. It is published under a Creative Commons Attribution 2.5 Australia licence so is suitable for use here.

Previewing this file, the data really starts in row 2 with the column names. Let’s say we are only interested in some data, somewhat arbitrarily Taxable Income and Private Health cover status related data across States and Postcodes. So we only need to import Columns 1, 2, 3, 5 and 152.

tax2020_url <- 'https://data.gov.au/data/dataset/5fa69f19-ec44-4c46-88eb-0f6fd5c2f43b/resource/d2eb3863-78c6-4afe-a348-83043df5aeab/download/ts20individual06taxablestatusstateterritorypostcode.xlsx'

download.file(tax2020_url, 'tax2020.xlsx', mode = 'wb')

tax2020_raw <- read_excel('tax2020.xlsx', sheet = 'Table 6B', skip = 1, col_names = TRUE)[ ,c(1,2,3,5,152)]

head(tax2020_raw, 5)
## # A tibble: 5 × 5
##   `State/ Territory1` Postcode `Individuals\r\nno.` Taxable income or loss3\r\…¹
##   <chr>               <chr>                   <dbl>                        <dbl>
## 1 ACT                 2600                     5945                    710218557
## 2 ACT                 2601                     3159                    214621509
## 3 ACT                 2602                    22009                   1747368144
## 4 ACT                 2603                     7165                    906618402
## 5 ACT                 2604                     8617                    821838536
## # ℹ abbreviated name: ¹​`Taxable income or loss3\r\n$`
## # ℹ 1 more variable: `People with private health insurance\r\nno.` <dbl>

R provides the functionality to change the column names to something easier to work with.

names(tax2020_raw) <- c('State', 'Postcode', 'Returns', 'TaxableIncome_dollars', 'PrivateHealth_returns')
head(tax2020_raw)
## # A tibble: 6 × 5
##   State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
##   <chr> <chr>      <dbl>                 <dbl>                 <dbl>
## 1 ACT   2600        5945             710218557                  4822
## 2 ACT   2601        3159             214621509                  1670
## 3 ACT   2602       22009            1747368144                 14747
## 4 ACT   2603        7165             906618402                  5617
## 5 ACT   2604        8617             821838536                  6118
## 6 ACT   2605        7973             748963372                  6126

Using only code, R provides the ability to look at the structure (str()) and summary (sum()) of the data. Other useful functions are nrows() and ncols().

# Structure
str(tax2020_raw)
## tibble [2,663 × 5] (S3: tbl_df/tbl/data.frame)
##  $ State                : chr [1:2663] "ACT" "ACT" "ACT" "ACT" ...
##  $ Postcode             : chr [1:2663] "2600" "2601" "2602" "2603" ...
##  $ Returns              : num [1:2663] 5945 3159 22009 7165 8617 ...
##  $ TaxableIncome_dollars: num [1:2663] 7.10e+08 2.15e+08 1.75e+09 9.07e+08 8.22e+08 ...
##  $ PrivateHealth_returns: num [1:2663] 4822 1670 14747 5617 6118 ...
# Summary
summary(tax2020_raw)
##     State             Postcode            Returns       TaxableIncome_dollars
##  Length:2663        Length:2663        Min.   :    50   Min.   : -44209909   
##  Class :character   Class :character   1st Qu.:   382   1st Qu.:  19325060   
##  Mode  :character   Mode  :character   Median :  1974   Median : 105405855   
##                                        Mean   :  5617   Mean   : 358836001   
##                                        3rd Qu.:  8243   3rd Qu.: 535115412   
##                                        Max.   :144907   Max.   :4246643114   
##  PrivateHealth_returns
##  Min.   :   12        
##  1st Qu.:  199        
##  Median :  993        
##  Mean   : 3124        
##  3rd Qu.: 4688        
##  Max.   :35364

To access rows, columns, or any combination of these, there are multiple ways of achieving this in R.

# Columns
head(tax2020_raw[ , 2], 3)
## # A tibble: 3 × 1
##   Postcode
##   <chr>   
## 1 2600    
## 2 2601    
## 3 2602
head(tax2020_raw[ , "Postcode"], 4)
## # A tibble: 4 × 1
##   Postcode
##   <chr>   
## 1 2600    
## 2 2601    
## 3 2602    
## 4 2603
tail(tax2020_raw$Postcode, 4)
## [1] "6979"     "6981"     "6985"     "WA other"
print(unique(tax2020_raw$State))
##  [1] "ACT"      "NSW"      "NT"       "Overseas" "QLD"      "SA"      
##  [7] "TAS"      "Unknown"  "VIC"      "WA"
max(tax2020_raw$PrivateHealth_returns)
## [1] 35364
# Rows
tax2020_raw[2, ]
## # A tibble: 1 × 5
##   State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
##   <chr> <chr>      <dbl>                 <dbl>                 <dbl>
## 1 ACT   2601        3159             214621509                  1670
tax2020_raw[tax2020_raw$State %in% c("Unknown","Overseas"),]
## # A tibble: 2 × 5
##   State    Postcode Returns TaxableIncome_dollars PrivateHealth_returns
##   <chr>    <chr>      <dbl>                 <dbl>                 <dbl>
## 1 Overseas Overseas  144907            3467614939                 18894
## 2 Unknown  Unknown     1976              97251004                   873
tax2020_raw[tax2020_raw$Postcode == 6102, ]
## # A tibble: 1 × 5
##   State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
##   <chr> <chr>      <dbl>                 <dbl>                 <dbl>
## 1 WA    6102        7899             381711165                  3814
# Rows and Columns
tax2020_raw[2,2]
## # A tibble: 1 × 1
##   Postcode
##   <chr>   
## 1 2601
tax2020_raw[tax2020_raw$PrivateHealth_returns == max(tax2020_raw$PrivateHealth_returns), c(1,2,5) ]     
## # A tibble: 1 × 3
##   State Postcode PrivateHealth_returns
##   <chr> <chr>                    <dbl>
## 1 QLD   4350                     35364

There are two interesting aspects of the data above which demonstrate the need to ‘clean’ data.

  • The tail command above reveals that the data is not exclusively ‘per postcode’; if the number of returns was small those postcodes are grouped into an ‘Other’ row. We will leave this for the moment and observe the impact later in this workflow.

  • There are some values for ‘Overseas’ and ‘Unknown’ which are not of interest, so in base R we would create a new dataset without these.

# Create an new, intermediate table without these rows
tax2020_raw_aus <- tax2020_raw[tax2020_raw$State !="Unknown" & tax2020_raw$State!="Overseas",]

# Check the rows are now excluded 
head(tax2020_raw_aus[tax2020_raw_aus$State %in% c("Unknown","Overseas"),])
## # A tibble: 0 × 5
## # ℹ 5 variables: State <chr>, Postcode <chr>, Returns <dbl>,
## #   TaxableIncome_dollars <dbl>, PrivateHealth_returns <dbl>

The Tidyverse!

In cleaning and subsetting the data above we now have two data frames, namely tax2020_raw and tax2020_raw_aus. In trying to keep the R commands relatively short and understandable we can end up with a lot of intermediate or temporary data frames, which can be difficult to keep track of. We could choose to not create the intermediate dataframes using a lot of ‘nesting’, though this leads to long complicated commands with lots of brackets! Another alternative is to use pipes, where the output of one command is ‘piped’ into the next command and so on. This strikes a great balance between command readability and minimising intermediate data frames.

There is an R package called Tidyverse, which includes a set of key R extension packages (including dplyr and tidyr) intended to make using (and learning) R easier for beginners. It includes the piping functionality, along with functions which filter, reshape and plot data. We will use the Tidyverse commands in the following analysis. In fact we already have, using read_excel above.

For example, to achieve removing the same rows in the previous step, we can ‘pipe’ the tax2020_raw data to the filter() command from the Tidyverse. The symbol for pipe is %>%, the keyboard shortcut for which is Ctrl+Shift+M, or Command+Shift+M on a Mac.

Note that the head command plays nicely with the piping too.

tax2020_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>%
  head(5)
## # A tibble: 5 × 5
##   State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
##   <chr> <chr>      <dbl>                 <dbl>                 <dbl>
## 1 ACT   2600        5945             710218557                  4822
## 2 ACT   2601        3159             214621509                  1670
## 3 ACT   2602       22009            1747368144                 14747
## 4 ACT   2603        7165             906618402                  5617
## 5 ACT   2604        8617             821838536                  6118

Key Learning

Key Learning #1 - Data is data, there is no need to constantly have a dedicated view for the raw, original data. These tools allow us to view it in any form needed so as to inform our analysis and visualisation.

Key Learning #2 - Cleaning the data involves investigating the original data, leaving it as it is and writing code to create a workable dataset, having removed unnecessary or incorrect data.

Key Learning #3 - Using pipes makes it simpler to prepare, read and modify code and eliminates the need for the clutter of many intermediate or temporary data frames.

Further Learning

Further Learning #1 - The datasets in this workshop are quite ‘clean’ and complete. Then there are datasets which are incomplete with data that is not available or NA - for another time.

Dataset 1 - Analysis

As an example, let’s perform some aggregate functions, such as sums or totals of dollars and returns for each State. Whereas filter() acts on rows, select() acts on columns. We will aggregate by State, so we remove the Postcode column using select(). The !Postcode here is read as ‘select all columns that are not the Postcode column’.

To calculate the sums we can pipe the data to a group_by() command to group by State, and then pipe that result to a summarise_all() command to perform the aggregation on all columns. It’s also possible to use summarise() to sum individual columns.

# Totals by State
tax2020_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  select(!Postcode) %>% 
  group_by(State) %>% 
  summarise_all(sum)
## # A tibble: 8 × 4
##   State Returns TaxableIncome_dollars PrivateHealth_returns
##   <chr>   <dbl>                 <dbl>                 <dbl>
## 1 ACT    287586           21930535789                189069
## 2 NSW   4678175          315279570038               2718411
## 3 NT     130549            8630916786                 64608
## 4 QLD   2985460          180514395271               1520214
## 5 SA    1010814           57665385463                606316
## 6 TAS    312563           16843743756                162781
## 7 VIC   3823784          242808117837               1953642
## 8 WA    1582776          108342740689               1085169

To calculate the sums for the whole of Australia for 2019-2020, let’s filter the State column too.

# Totals for Australia
tax2020_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  select(!c('Postcode', 'State')) %>% 
  summarise_all(sum)
## # A tibble: 1 × 3
##    Returns TaxableIncome_dollars PrivateHealth_returns
##      <dbl>                 <dbl>                 <dbl>
## 1 14811707          952015405629               8300210

As a further example, let’s calculate the

  • Taxable Income per return per State, and
  • Percent of Private Health Insurance per State

Here we are creating two new calculated columns based on the data for each row, and so will use the mutate() command to create the new columns.

# Means per State
tax2020_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  mutate(TaxableIncome_dollarspr = TaxableIncome_dollars/Returns) %>% 
  mutate(PrivateHealth_percentpp = round(PrivateHealth_returns/Returns*100,0)) %>% 
  select(State, TaxableIncome_dollarspr, PrivateHealth_percentpp ) %>%
  group_by(State) %>% 
  summarise_all(mean)
## # A tibble: 8 × 3
##   State TaxableIncome_dollarspr PrivateHealth_percentpp
##   <chr>                   <dbl>                   <dbl>
## 1 ACT                    78085.                    64.2
## 2 NSW                    61297.                    59.7
## 3 NT                     67982.                    49.7
## 4 QLD                    54410.                    50.4
## 5 SA                     53403.                    58.7
## 6 TAS                    51653.                    51.1
## 7 VIC                    58096.                    49.1
## 8 WA                     64883.                    67.2

Wide and Narrow Formats

The raw data is in summary form, or wide form. Easily read by people, not ideal for all the processing options available for machines eg AI.

Sometimes tasks in R are more easily achieved with the data in narrow or long format, where each row essentially only has one item of data.

Fortunately R and the tidyverse have tools which allow for easily swapping between formats, namely pivot_wider and pivot_longer.

What is important is knowing which columns are to be kept, often called identifier variables ( Postcode and State ), and which columns are to be pivoted, often called measured variables ( Returns, Taxable Income and Private Health status ).

Let’s create a temporary dataframe tax2020_raw_long here just to show the effect of moving from wide to narrow/long formats and back again.

# From Wide to Narrow/Long
tax2020_raw_long <- tax2020_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  pivot_longer( cols = -c(Postcode, State), 
                names_to = "item", 
                values_to = "value", 
                values_drop_na = TRUE)
head(tax2020_raw_long)
## # A tibble: 6 × 4
##   State Postcode item                      value
##   <chr> <chr>    <chr>                     <dbl>
## 1 ACT   2600     Returns                    5945
## 2 ACT   2600     TaxableIncome_dollars 710218557
## 3 ACT   2600     PrivateHealth_returns      4822
## 4 ACT   2601     Returns                    3159
## 5 ACT   2601     TaxableIncome_dollars 214621509
## 6 ACT   2601     PrivateHealth_returns      1670
# and then back from Narrow/Long to Wide
tax2020_raw_long %>% 
  pivot_wider( names_from = "item", 
               values_from = "value") %>%
  head()
## # A tibble: 6 × 5
##   State Postcode Returns TaxableIncome_dollars PrivateHealth_returns
##   <chr> <chr>      <dbl>                 <dbl>                 <dbl>
## 1 ACT   2600        5945             710218557                  4822
## 2 ACT   2601        3159             214621509                  1670
## 3 ACT   2602       22009            1747368144                 14747
## 4 ACT   2603        7165             906618402                  5617
## 5 ACT   2604        8617             821838536                  6118
## 6 ACT   2605        7973             748963372                  6126

Note: The tidyverse commands such as group_by()/summarise_all() effectively perform the necessary conversions from wide to narrow/long formats, so there isn’t really the need to explicitly perform these conversions as would be the case with base R - using such packages s reshape2 with the commands melt() and cast().

Visualisation and ggplot

R also has functions to visualise data. One common library used for visualisations is ggplot2, which is included in the tidyverse. The code defines the data to be used, and then we use code to generate the graphs and assign values to the different aspects of the graphs. Code generated visualisation can be very efficient compared to GUI based platforms.

For example, to quickly visualise (without much styling!) the summed totals of the raw data per State

  • we take the code from earlier (#Totals by State),
  • use pivot_longer() to transform from wide to narrow/long format ready for ggplot, keeping the State as the identifier column,
  • use ggplot() adding geom_bar() for the bargraph and a facet_wrap() to show the different items in separate facets, each with independent y axis ranges (thanks to ’free_y’ scaling)

Note that here we need to assign the output of ggplot to a variable plot_states using <- and then display the contents of that variable, which is the plot.

plot2020_states_totals <- tax2020_raw %>%  
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  select(!Postcode) %>% 
  group_by(State) %>% 
  summarise_all(sum) %>% 
  pivot_longer( cols = -c(State), 
                names_to = "item", 
                values_to = "value", 
                values_drop_na = TRUE) %>% 
  ggplot(aes(x=State, y=value, fill=State)) +
  geom_bar(stat = "identity") + facet_wrap( vars(item), scales="free_y")

plot2020_states_totals

To quickly visualise the mean values per state

  • we take the code from earlier (#Means per State),
  • use pivot_longer() and ggplot() as before to produce the graph.
plot2020_state_means <- tax2020_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  mutate(TaxableIncome_dollarspr = TaxableIncome_dollars/Returns) %>% 
  mutate(PrivateHealth_percentpp = round(PrivateHealth_returns/Returns*100,0)) %>% 
  select(State, TaxableIncome_dollarspr, PrivateHealth_percentpp ) %>% 
  group_by(State) %>% 
  summarise_all(mean) %>% 
  pivot_longer( cols = -c(State), 
                names_to = "item", 
                values_to = "value", 
                values_drop_na = TRUE) %>% 
  ggplot(aes(x=State, y=value, fill=State)) +
  geom_bar(stat = "identity") + facet_wrap( vars(item), scales="free_y")

plot2020_state_means

Automation

When the source data changes, for example more data samples are collected or updated, using code to manipulate the data brings a massive advantage - automation. The same code can be re-executed on the new data for updated analysis and visualisations.

As an example, here is the code used to sum the three Taxation parameters for Australia for 2019-2020, re-executed for the 2018-2019 dataset, also published under Creative Commons Attribution 2.5 Australia.

Compare the results for 2018/19 and 2019/20.

tax2019_url <- 'https://data.gov.au/data/dataset/2805b28d-2c3b-47e2-87c3-50aacc6ea212/resource/3580e8f5-57ac-4848-9db0-48ba0c4a8a65/download/ts19individual06taxablestatusstateterritorypostcode.xlsx'

download.file(tax2019_url, 'tax2019.xlsx', mode = 'wb')

tax2019_raw <- read_excel('tax2019.xlsx', sheet = 'Table 6B', skip = 1, col_names = TRUE)[ ,c(1,2,3,5,152)]

names(tax2019_raw) <- c('State', 'Postcode', 'Returns', 'TaxableIncome_dollars', 'PrivateHealth_returns')

tax2019_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  select(!c('Postcode', 'State')) %>% 
  summarise_all(sum)
## # A tibble: 1 × 3
##    Returns TaxableIncome_dollars PrivateHealth_returns
##      <dbl>                 <dbl>                 <dbl>
## 1 14521870          913971167507               8175839
tax2020_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  select(!c('Postcode', 'State')) %>% 
  summarise_all(sum)
## # A tibble: 1 × 3
##    Returns TaxableIncome_dollars PrivateHealth_returns
##      <dbl>                 <dbl>                 <dbl>
## 1 14811707          952015405629               8300210

Also compare the plot of mean values for 2018/19.

plot2019_state_means <- tax2019_raw %>% 
  filter( State !="Unknown" & State!="Overseas" ) %>% 
  mutate(TaxableIncome_dollarspr = TaxableIncome_dollars/Returns) %>% 
  mutate(PrivateHealth_percentpp = round(PrivateHealth_returns/Returns*100,0)) %>% 
  select(State, TaxableIncome_dollarspr, PrivateHealth_percentpp ) %>% 
  group_by(State) %>% 
  summarise_all(mean) %>% 
  pivot_longer( cols = -c(State), 
                names_to = "item", 
                values_to = "value", 
                values_drop_na = TRUE) %>% 
  ggplot(aes(x=State, y=value, fill=State)) +
  geom_bar(stat = "identity") + facet_wrap( vars(item), scales="free_y")

plot2019_state_means

Key Learning

Key Learning #4 - Data frame structures are easily transformed in R, transform to whatever form is convenient for a particular purpose.

Key Learning #5 - Using code to perform analysis and generate graphs and visualisations saves a lot of time and finessing, particularly when tasks need to be repeated regularly and often.

Further Learning

Further Learning #2 - The above visualisations were generated quickly, without too much concern for formatting. Every aspect of the graphs above can be controlled to give beautiful and purposeful visualisations.

Previous

Part 2 - Basics

Next

Part 4 - Combining two datasets

Curtin logo

Version 1.8 - July 2024

COPYRIGHT © Curtin University 2024