library("tidyverse")

I want to create and visualize a simple data set for my Data Science courses (that I teach in California).

Data Source

University of California
Agriculture and Natural Resources
Statewide Integrated Pest Management Program
https://ipm.ucanr.edu/WEATHER/wxactstnames.html

Fixed-Width Files

Today I learned how to read fixed-width files in the Tidyverse. From there, I simply need to give the columns easy-to-use names.

LA_df <- readr::read_fwf("LA_2022.txt")

Rows: 365 Columns: 9
── Column specification ────────────────────────────────────────────────────────

chr  (4): X1, X4, X7, X9
dbl  (4): X3, X5, X6, X8
time (1): X2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

colnames(LA_df) <- c("date", "time", "precipitation",
                     "check1", "high", "low", "check2", "solar", "check3")
LA_df$city <- "Los Angeles"

Merced_df <- readr::read_fwf("Merced_2022.txt")

Rows: 365 Columns: 16
── Column specification ────────────────────────────────────────────────────────

chr   (4): X1, X4, X7, X12
dbl  (11): X3, X5, X6, X8, X9, X10, X11, X13, X14, X15, X16
time  (1): X2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

colnames(Merced_df) <- c("date", "time", "precipitation",
                         "check1", "high", "low", "check2")
Merced_df$city <- "Merced"

SF_df <- readr::read_fwf("SF_2022.txt")

Rows: 365 Columns: 7
── Column specification ────────────────────────────────────────────────────────

chr  (3): X1, X4, X7
dbl  (3): X3, X5, X6
time (1): X2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

colnames(SF_df) <- c("date", "time", "precipitation",
                     "check1", "high", "low", "check2")
SF_df$city <- "San Francisco"

Merge

Some of the weather stations had collected more information than others. That is, if the weather station was newer, then it had more instruments.

For today’s quick exploration, I actually do want to perform a quick rbind, and that requires that each of the 3 data frames have the same number of columns (and should be the same types of information too).

LA_df <- LA_df |>
  select(city, date, time, high, low, precipitation)
Merced_df <- Merced_df |>
  select(city, date, time, high, low, precipitation)
SF_df <- SF_df |>
  select(city, date, time, high, low, precipitation)

CA_weather_data <- rbind(LA_df, Merced_df, SF_df)

# write_csv(CA_weather_data, "CA_weather_data.csv")

Data Viz

Now, boxplots are easy to make.

CA_weather_data |>
  ggplot(aes(y = high)) +
  geom_boxplot() +
  labs(title = "California Weather, High Temperatures",
       subtitle = "(all together)",
       caption = "Source: UC\nAgriculture and Natural Resources\nStatewide Integrated Pest Management Program")

CA_weather_data |>
  ggplot(aes(x = city, y = high, fill = city)) +
  geom_boxplot() +
  labs(title = "California Weather, High Temperatures",
       subtitle = "(separate groups)",
       caption = "Source: UC\nAgriculture and Natural Resources\nStatewide Integrated Pest Management Program")

Sample

For the creation of a classroom example, I want to randomly select 43 observations from the Merced data.

Merced_sample <- sort(sample(Merced_df$high, 43, replace = FALSE))
dput(Merced_sample)

c(53, 53, 55, 58, 58, 60, 60, 61, 62, 65, 67, 68, 70, 70, 71, 
72, 74, 75, 77, 82, 82, 83, 84, 84, 87, 87, 88, 90, 91, 91, 92, 
92, 93, 93, 94, 95, 96, 96, 98, 99, 101, 101, 105)