California Weather

Author

Derek Sollberger

Published

January 25, 2023

library("tidyverse")

I want to create and visualize a simple data set for my Data Science courses (that I teach in California).

Data Source

  • University of California
  • Agriculture and Natural Resources
  • Statewide Integrated Pest Management Program
  • https://ipm.ucanr.edu/WEATHER/wxactstnames.html

Fixed-Width Files

Today I learned how to read fixed-width files in the Tidyverse. From there, I simply need to give the columns easy-to-use names.

LA_df <- readr::read_fwf("LA_2022.txt")
Rows: 365 Columns: 9
── Column specification ────────────────────────────────────────────────────────

chr  (4): X1, X4, X7, X9
dbl  (4): X3, X5, X6, X8
time (1): X2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(LA_df) <- c("date", "time", "precipitation",
                     "check1", "high", "low", "check2", "solar", "check3")
LA_df$city <- "Los Angeles"
Merced_df <- readr::read_fwf("Merced_2022.txt")
Rows: 365 Columns: 16
── Column specification ────────────────────────────────────────────────────────

chr   (4): X1, X4, X7, X12
dbl  (11): X3, X5, X6, X8, X9, X10, X11, X13, X14, X15, X16
time  (1): X2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(Merced_df) <- c("date", "time", "precipitation",
                         "check1", "high", "low", "check2")
Merced_df$city <- "Merced"
SF_df <- readr::read_fwf("SF_2022.txt")
Rows: 365 Columns: 7
── Column specification ────────────────────────────────────────────────────────

chr  (3): X1, X4, X7
dbl  (3): X3, X5, X6
time (1): X2

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(SF_df) <- c("date", "time", "precipitation",
                     "check1", "high", "low", "check2")
SF_df$city <- "San Francisco"

Merge

Some of the weather stations had collected more information than others. That is, if the weather station was newer, then it had more instruments.

For today’s quick exploration, I actually do want to perform a quick rbind, and that requires that each of the 3 data frames have the same number of columns (and should be the same types of information too).

LA_df <- LA_df |>
  select(city, date, time, high, low, precipitation)
Merced_df <- Merced_df |>
  select(city, date, time, high, low, precipitation)
SF_df <- SF_df |>
  select(city, date, time, high, low, precipitation)

CA_weather_data <- rbind(LA_df, Merced_df, SF_df)
# write_csv(CA_weather_data, "CA_weather_data.csv")

Data Viz

Now, boxplots are easy to make.

CA_weather_data |>
  ggplot(aes(y = high)) +
  geom_boxplot() +
  labs(title = "California Weather, High Temperatures",
       subtitle = "(all together)",
       caption = "Source: UC\nAgriculture and Natural Resources\nStatewide Integrated Pest Management Program")

CA_weather_data |>
  ggplot(aes(x = city, y = high, fill = city)) +
  geom_boxplot() +
  labs(title = "California Weather, High Temperatures",
       subtitle = "(separate groups)",
       caption = "Source: UC\nAgriculture and Natural Resources\nStatewide Integrated Pest Management Program")

Sample

For the creation of a classroom example, I want to randomly select 43 observations from the Merced data.

Merced_sample <- sort(sample(Merced_df$high, 43, replace = FALSE))
dput(Merced_sample)
c(53, 53, 55, 58, 58, 60, 60, 61, 62, 65, 67, 68, 70, 70, 71, 
72, 74, 75, 77, 82, 82, 83, 84, 84, 87, 87, 88, 90, 91, 91, 92, 
92, 93, 93, 94, 95, 96, 96, 98, 99, 101, 101, 105)