library("tidyverse")
I want to create and visualize a simple data set for my Data Science courses (that I teach in California).
Data Source
- University of California
- Agriculture and Natural Resources
- Statewide Integrated Pest Management Program
- https://ipm.ucanr.edu/WEATHER/wxactstnames.html
Fixed-Width Files
Today I learned how to read fixed-width files
in the Tidyverse. From there, I simply need to give the columns easy-to-use names.
<- readr::read_fwf("LA_2022.txt") LA_df
Rows: 365 Columns: 9
── Column specification ────────────────────────────────────────────────────────
chr (4): X1, X4, X7, X9
dbl (4): X3, X5, X6, X8
time (1): X2
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(LA_df) <- c("date", "time", "precipitation",
"check1", "high", "low", "check2", "solar", "check3")
$city <- "Los Angeles" LA_df
<- readr::read_fwf("Merced_2022.txt") Merced_df
Rows: 365 Columns: 16
── Column specification ────────────────────────────────────────────────────────
chr (4): X1, X4, X7, X12
dbl (11): X3, X5, X6, X8, X9, X10, X11, X13, X14, X15, X16
time (1): X2
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(Merced_df) <- c("date", "time", "precipitation",
"check1", "high", "low", "check2")
$city <- "Merced" Merced_df
<- readr::read_fwf("SF_2022.txt") SF_df
Rows: 365 Columns: 7
── Column specification ────────────────────────────────────────────────────────
chr (3): X1, X4, X7
dbl (3): X3, X5, X6
time (1): X2
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(SF_df) <- c("date", "time", "precipitation",
"check1", "high", "low", "check2")
$city <- "San Francisco" SF_df
Merge
Some of the weather stations had collected more information than others. That is, if the weather station was newer, then it had more instruments.
For today’s quick exploration, I actually do want to perform a quick rbind
, and that requires that each of the 3 data frames have the same number of columns (and should be the same types of information too).
<- LA_df |>
LA_df select(city, date, time, high, low, precipitation)
<- Merced_df |>
Merced_df select(city, date, time, high, low, precipitation)
<- SF_df |>
SF_df select(city, date, time, high, low, precipitation)
<- rbind(LA_df, Merced_df, SF_df) CA_weather_data
# write_csv(CA_weather_data, "CA_weather_data.csv")
Data Viz
Now, boxplots are easy to make.
|>
CA_weather_data ggplot(aes(y = high)) +
geom_boxplot() +
labs(title = "California Weather, High Temperatures",
subtitle = "(all together)",
caption = "Source: UC\nAgriculture and Natural Resources\nStatewide Integrated Pest Management Program")
|>
CA_weather_data ggplot(aes(x = city, y = high, fill = city)) +
geom_boxplot() +
labs(title = "California Weather, High Temperatures",
subtitle = "(separate groups)",
caption = "Source: UC\nAgriculture and Natural Resources\nStatewide Integrated Pest Management Program")
Sample
For the creation of a classroom example, I want to randomly select 43 observations from the Merced data.
<- sort(sample(Merced_df$high, 43, replace = FALSE))
Merced_sample dput(Merced_sample)
c(53, 53, 55, 58, 58, 60, 60, 61, 62, 65, 67, 68, 70, 70, 71,
72, 74, 75, 77, 82, 82, 83, 84, 84, 87, 87, 88, 90, 91, 91, 92,
92, 93, 93, 94, 95, 96, 96, 98, 99, 101, 101, 105)