Conversation
Merge branch 'develop' of github.com:DOI-USGS/dataRetrieval into dv_service # Conflicts: # R/read_USGS_samples.R
R/construct_api_requests.R
Outdated
| POST = TRUE | ||
| } | ||
|
|
||
| time_periods <- c("last_modified", "datetime", "time") |
There was a problem hiding this comment.
also "begin" and "end" on time-series-metadata
mikemahoney218-usgs
left a comment
There was a problem hiding this comment.
realized too late I was slowly working through the full code and should actually group comments into a review instead
| format_api_dates <- function(datetime){ | ||
|
|
||
| if(!any(isTRUE(is.na(datetime)) | isTRUE(is.null(datetime)))){ | ||
| if(length(datetime) == 1){ |
There was a problem hiding this comment.
will this work for duration requests -- eg, time=P1D?
|
|
||
| explode_post <- function(ls){ | ||
|
|
||
| ls <- Filter(Negate(anyNA), ls) |
There was a problem hiding this comment.
love seeing Negate() in the wild
R/construct_api_requests.R
Outdated
|
|
||
| #' Check OGC requests | ||
| #' | ||
| #' @param endpoint Character, can be "daily", "timeseries-metadata" |
There was a problem hiding this comment.
| #' @param endpoint Character, can be "daily", "timeseries-metadata" | |
| #' @param endpoint Character, can be "daily", "time-series-metadata", "monitoring-locations" |
Might be worth genericizing to "can be any existing collection"?
R/read_USGS_daily.R
Outdated
|
|
||
| service <- "daily" | ||
|
|
||
| dv_req <- construct_api_requests(service = service, |
There was a problem hiding this comment.
for what it's worth, I've used a much less readable but self-maintaining approach with do.call in the past
https://github.com/Permian-Global-Research/rsi/blob/main/R/get_stac_data.R#L702-L705
R/walk_pages.R
Outdated
| make_request <- httr2::req_url(req = req, | ||
| url = json_content$links[, "href", drop = TRUE][next_page]) | ||
|
|
||
| Tailcall( |
There was a problem hiding this comment.
I just saw that Tailcall is an R 4.4 feature -- might be worth moving to a loop instead to keep backwards compatibility?
There was a problem hiding this comment.
Dang it, I only learned it from you but it's a sweet feature!
There was a problem hiding this comment.
"Tailcall and Exec are experimental and may be changed or dropped in future released versions of R."
There was a problem hiding this comment.
yeah, I'm personally betting that note will be there in 20 years 😆
ehinman
left a comment
There was a problem hiding this comment.
Still doing some testing and such, but here are a few thoughts/discoveries.
R/walk_pages.R
Outdated
|
|
||
| walk_pages <- function(req) { | ||
|
|
||
| walk_pages_recursive( |
There was a problem hiding this comment.
Why does walk_pages_recursive() need to be in a wrapper function?
There was a problem hiding this comment.
Pretty sure this is derived from the version I sent Laura a while back -- and I think I set it up this way so that you can initialize contents and page outside of the recursive context, so they don't get reset and you don't need global assignment. I think
| ``` | ||
|
|
||
|
|
||
| # API Tokens |
There was a problem hiding this comment.
I would suggest placing this at the top somewhere. I could see someone running through this vignette, hitting an error with the first example, and not knowing that they need to scroll down to find the solution.
| @@ -0,0 +1,102 @@ | |||
| #' Get USGS Daily Data | |||
| #' | |||
| #' Description `r get_description("daily")` | |||
There was a problem hiding this comment.
I find the first sentence of the description confusing: "Daily data provide one data value to represent water conditions for the day." I'd add "per parameter" or similar. I know this is at the API level. @mikemahoney218-usgs thoughts?
There was a problem hiding this comment.
We steal this from WDFN:
https://waterdata.usgs.gov/data-collection-categories//#daily-data
They also provide multiple different statistics -- this paragraph makes more sense if you change the opening to "Daily data time series provide"
If we've got suggestions, I'm happy to ping the UI team -- though I think this page is a Leah project and I don't know how much attention it'll get before she's back!
| #' @param properties The properties that should be included for each feature. | ||
| #' The parameter value is a comma-separated list of property names. Available options are | ||
| #' `r schema <- check_OGC_requests(endpoint = "daily", type = "schema"); paste(names(schema$properties), collapse = ", ")` | ||
| #' @param bbox Only features that have a geometry that intersects the bounding |
There was a problem hiding this comment.
Can you help me out and enumerate which northing/easting/southing/westing goes where in the bbox order? I always forget. Or at least link out to the answer?
There was a problem hiding this comment.
The format is a string consisting of:
Western-most longitude
Southern-most latitude
Eastern-most longitude
Northern-most longitude
There was a problem hiding this comment.
xmin,ymin,xmax,ymax
I know this is specified somewhere... but I don't have the standard number to hand.
Something that might be wroth flagging is that sf::st_bbox will automatically be in the right sequence (because it uses the same standard)
There was a problem hiding this comment.
Still rooting for a template here in the description.
There was a problem hiding this comment.
I did add an example in the monitoring-location function (since hopefully more people would use bbox's in that service compared to the daily service). But I'll add some manual text to the param descriptions.
| #' parameter_code = c("00060", "00065"), | ||
| #' datetime = c(start_date, end_date)) | ||
| #' | ||
| construct_api_requests <- function(service, |
There was a problem hiding this comment.
What am I doing wrong here? I tried both bboxes just in case I had it wrong. Errored on both.
bbox_stuff = c(-94.16516107720525, 34.985431123844755, -92.32139202893917, 36.13639585246413)
bbox_stuff = c("-94.16516107720525", "34.985431123844755", "-92.32139202893917", "36.13639585246413")
bbox_test <- construct_api_requests(service = "daily", bbox = bbox_stuff, parameter_code = '00060')
Returning:
"Error in if (all(is.na(full_list)) & is.na(bbox)) { :
the condition has length > 1"
Seems to be around line 66 of construct_api_requests
There was a problem hiding this comment.
Why and how did this work!? I accidentally put in monitoring_location_identifier and it told me it was an unused argument, but it was okay with monitoring_location.
site_test <- read_USGS_daily(monitoring_location = "USGS-06137570", parameter_code = '00060')
There was a problem hiding this comment.
monitoring_location_id is the full parameter -- you partially matched it with the second example
We use ID instead of identifier throughout because of the geojson standard suggesting "id" as a top-level parameter. It's a bit of an infelicity though, considering we use very few abbreviations otherwise
There was a problem hiding this comment.
Whoops sorry about that, pushed up a change yesterday and didn't test the bbox for side affects
| #' box are selected.The bounding box is provided as four or six numbers, depending | ||
| #' on whether the coordinate reference system includes a vertical axis (height or | ||
| #' depth). | ||
| #' @param properties The properties that should be included for each feature. The |
There was a problem hiding this comment.
Are properties the same as columns in this case, and features represent each....row? I find the term "property" understandable in terms of a JSON response, but maybe a little less familiar as another term for "column". Could definitely get used to it though!
There was a problem hiding this comment.
Yes, exactly that! GeoJSON stashes attributes under the "properties" key, so if you're making direct API calls it makes sense that you'd filter that array down using the properties argument.
There was a problem hiding this comment.
I toyed initially with the idea of calling the argument "columns", but don't want to sway too far from the actual service. How about we be more explicit in the descriptions, but still call the actual argument "properties"?
#' @param properties A vector of requested columns to be returned from the query.
#' Available options are:
There was a problem hiding this comment.
I like this solution. I appreciate sticking to the API service!
| #' time = c("2023-01-01", "2024-01-01")) | ||
| #' | ||
| #' } | ||
| read_USGS_daily <- function(monitoring_location_id = NA_character_, |
There was a problem hiding this comment.
Another place I'm getting an error:
site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', statistic_id = "00002")
Function in development, use at your own risk.
GET: https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570¶meter_code=00060&statistic_id=00002
Error in order(return_list$time, return_list$monitoring_location_id) :
argument 1 is not a vector
In addition: Warning messages:
1: In min(bb[, 1L], na.rm = TRUE) :
no non-missing arguments to min; returning Inf
2: In min(bb[, 2L], na.rm = TRUE) :
no non-missing arguments to min; returning Inf
3: In max(bb[, 3L], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
4: In max(bb[, 4L], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
5: Unknown or uninitialised column: `time`.
6: Unknown or uninitialised column: `monitoring_location_id`.
It does work if you wrap the statistic_id into a c().
site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', statistic_id = c("00003"))
Function in development, use at your own risk.
GET: https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570¶meter_code=00060&statistic_id=00003
This is also the same error you'd get if you try a non-existent statistic id. I just tried c("3") and it gave the same error. I understand why type checking isn't here yet, but something to consider.
There was a problem hiding this comment.
Hmmm....also here. I first just put in a YYYY-MM-DD and got an error, consulted the help, then tried a single datetime, and that failed, then tried a bounded interval:
site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', last_modified = "2018-02-12T00:00:00Z/2024-03-18T12:31:12Z")
Function in development, use at your own risk.
GET: https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570¶meter_code=00060&last_modified=2018-02-12T00%3A00%3A00Z%2F2024-03-18T12%3A31%3A12Z
Error in order(return_list$time, return_list$monitoring_location_id) :
argument 1 is not a vector
In addition: Warning messages:
1: In min(bb[, 1L], na.rm = TRUE) :
no non-missing arguments to min; returning Inf
2: In min(bb[, 2L], na.rm = TRUE) :
no non-missing arguments to min; returning Inf
3: In max(bb[, 3L], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
4: In max(bb[, 4L], na.rm = TRUE) :
no non-missing arguments to max; returning -Inf
5: Unknown or uninitialised column: `time`.
6: Unknown or uninitialised column: `monitoring_location_id`.
There was a problem hiding this comment.
For the first example, you asked for a stat code that the site doesn't offer - so a great example of a valid request with no returned data. I'll work up a better response. It's something to think about.
- The service itself gives some json:
{
"type":"FeatureCollection",
"features":[],
"numberReturned":0,
"links":[
{
"type":"application/geo+json",
"rel":"self",
"title":"This document as GeoJSON",
"href":"https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570¶meter_code=00060&statistic_id=00002"
},
{
"rel":"alternate",
"type":"application/ld+json",
"title":"This document as RDF (JSON-LD)",
"href":"https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=jsonld&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570¶meter_code=00060&statistic_id=00002"
},
{
"type":"text/html",
"rel":"alternate",
"title":"This document as HTML",
"href":"https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=html&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570¶meter_code=00060&statistic_id=00002"
},
{
"type":"application/json",
"title":"Daily values",
"rel":"collection",
"href":"https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily"
}
],
"timeStamp":"2025-05-23T14:18:51.680649Z"
}
Which, we can pretty easily check for the "numberReturned":0,. HOWEVER, historically dataRetrieval users really like their empty requests to come back with an empty data frame so they can have it in a big ol' loop and just keep binding rows together.
Side note: I'm about to convert all the readNWISdv and readNWISsite functions in the testthat tests to these functions - so there will probably be other error handling stuff like this fixed today.
There was a problem hiding this comment.
Similarly that 2nd example had no data that was last modified between 2018 and 2024. If you changed it to:
site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570",
parameter_code = '00060',
last_modified = "2024-03-18T12:31:12Z/2025-05-18T12:31:12Z")then you'd get some results. So, number 1 priority this morning is the no data result. Give me a little bit to think about that....
There was a problem hiding this comment.
These all work great now, with the first one returning a df of 0 rows. Thanks @ldecicco-USGS!!
| #' multi_site <- read_USGS_monitoring_location(state_name = "Wisconsin") | ||
| #' | ||
| #' } | ||
| read_USGS_monitoring_location <- function(monitoring_location_id = NA_character_, |
There was a problem hiding this comment.
Ran into an issue with this call (is it the query rate limiter or...?):
sites = whatNWISsites(statecd = "NY")
sites = paste0("USGS-", sites$site_no)
blah <- read_USGS_monitoring_location(monitoring_location_id = sites, skipGeometry = TRUE)
Function in development, use at your own risk.
POST: https://api.waterdata.usgs.gov/ogcapi/v0/collections/monitoring-locations/items?f=json&lang=en-US&skipGeometry=TRUE&limit=10000
Error in `httr2::req_perform()` at dataRetrieval/R/walk_pages.R:53:3:
! HTTP 403 Forbidden.
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/httr2_http_403>
Error in `httr2::req_perform()` at dataRetrieval/R/walk_pages.R:53:3:
! HTTP 403 Forbidden.
---
Backtrace:
▆
1. └─dataRetrieval::read_USGS_monitoring_location(...)
2. └─dataRetrieval:::walk_pages(site_req) at dataRetrieval/R/read_USGS_monitoring_location.R:132:3
3. └─dataRetrieval:::walk_pages_recursive(req = req, page = 1, contents = list()) at dataRetrieval/R/walk_pages.R:34:3
4. └─httr2::req_perform(req) at dataRetrieval/R/walk_pages.R:53:3
Run rlang::last_trace(drop = FALSE) to see 3 hidden frames.
Just interesting that if I use sites[1:100] as an input, it works fine, even with a POST. sites is 88,922 sites long, so maybe I'm hitting the limit of what the database accepts.
There was a problem hiding this comment.
Can I ask why you're querying for a specific 88,922 sites? I have a hard time imagining the use case for that one 😅
There was a problem hiding this comment.
LOL. I didn't realize you could query by state_name. Shame on me. Probably unlikely, but I probably wouldn't be the first, either. I was mimicking a workflow I've seen to grab all the site numbers within a state to then pull peaks data, and I wanted to see how it would handle a long set of sites. Yep, 89,000 is excessive.
There was a problem hiding this comment.
But another good reason for req_error() handling!
There was a problem hiding this comment.
Agree -- but our suggestion to this user is going to be "consider splitting your request into multiple pieces"
There was a problem hiding this comment.
Message looks good upon retrying it. Thanks!
| read_USGS_monitoring_location <- function(monitoring_location_id = NA_character_, | ||
| agency_code = NA_character_, | ||
| agency_name = NA_character_, | ||
| monitoring_location_number = NA_character_, |
There was a problem hiding this comment.
Interesting that there's id and number. My guess is that hardcore legacy users will want to stick with number.
There was a problem hiding this comment.
Data furnished from other agencies can have the same site number as USGS sites. Monitoring location IDs are guaranteed to be unique across the entire network.
Co-authored-by: Elise Hinman <[email protected]>
I'm not sure if I think this is more confusing to have these "shadow" functions. The pros are that it keeps some "normalcy" for what was, but the cons are that it keeps people from fully adopting the new function. Feel free to push back, I just would rather limit the number of available options. |
Yup, I think after some back-and-forth scattered above I've talked myself out of it too. |
I consider it as being the new function, just with a name that evokes the old function that may be easier for users to remember. I'll admit that my desire to add this is based on a frustration I sometimes have with I'm thinking that long-time |
|
OK folks! I know it's Friday afternoon before a 3 day weekend, so before I forget: I added a little bit of error handling here: I switched to We also return an empty data frame now if a user makes a valid request for no data (like, it's a site with data, but not during the requested time period). So give it another good test next week. |
|
Unless I hear objections, I'll plan to merge this tomorrow morning (5/28). That will make it easier to get a few more folks testing the new functions, and we'll be able to have small digestible PRs going forward (updates to the vignette, tests for some of the internal functions, etc...) |
| #' box are selected.The bounding box is provided as four or six numbers, depending | ||
| #' on whether the coordinate reference system includes a vertical axis (height or | ||
| #' depth). Coordinates are assumed to be in crs 4326. | ||
| #' @param limit The optional limit parameter limits the number of items that are |
There was a problem hiding this comment.
I'm confused about what this means. I tried this:
bbox <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5), parameter_code = c("00060", "00065"), limit = 100)
And it was super fast, and returned exactly 2000 rows, which seemed weird, but I'll roll with it.
I then tried this (no limit):
bbox <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5), parameter_code = c("00060", "00065"))
And it cannot seem to complete. I tried running it twice. Here's the error I received:

There was a problem hiding this comment.
I think we'll want to train users to use the read_USGS_ts_meta function to find the info they want first.
bbox_vals = c(-83, 36.5, -82, 37.5)
multi_site <- read_USGS_ts_meta(bbox = bbox_vals,
parameter_code = c("00060", "00065"))
library(dplyr)
ids_I_want <- multi_site |>
filter(end >= as.Date("2025-05-01"),
computation_identifier != "Instantaneous")
nrow(ids_I_want)
168Note, that is still a TON of data (1 of those sites claims it goes back to 1791?), it's going to exceed the amount we can ask for in 1 go.
But say you do rejigger the id's and find exactly the ones you are interested, then you could do this and avoid using the bounding box on the daily service.
multi_site_dv <- read_USGS_daily(time_series_id = ids_I_want$time_series_metadata_id[1:10])There was a problem hiding this comment.
Good point! I like this solution.
There was a problem hiding this comment.
I guess I'm still confused on what limit means, though. My limit was 100 and I got 2000 rows returned. What is the math on that?
There was a problem hiding this comment.
What I'm guessing... it seems like we can get 20 pages at most. At least in my little bit of testing, I'm seeing 20 pages coming back without a warning that you are missing data. I'll try to confirm, but here: what I see:
bbox_lim_100 <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5),
parameter_code = c("00060", "00065"),
limit = 100)
nrow(bbox_lim_100)
2000
bbox_lim_500 <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5),
parameter_code = c("00060", "00065"),
limit = 500)
nrow(bbox_lim_500)
10000
bbox_lim_1000 <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5),
parameter_code = c("00060", "00065"),
limit = 1000)
nrow(bbox_lim_1000)
20000
bbox_lim_2000 <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5),
parameter_code = c("00060", "00065"),
limit = 2000)
nrow(bbox_lim_2000)
40000
There was a problem hiding this comment.
HA! That's on me! I didn't realize there was a limit on the req_perform_iterative function:
req_perform_iterative(
req,
next_req,
path = NULL,
max_reqs = 20, # <- WHOOOPS!
on_error = c("stop", "return"),
progress = TRUE
)
|
|
||
| return_list <- rejigger_cols(return_list, properties, service) | ||
|
|
||
| return(return_list) |
There was a problem hiding this comment.
Observation: I tried making a big-ish call (again, maybe I'm being unreasonable here, but that's not THAT big of a bbox)
bbox <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5))
I loved the progress bar, but it kept upping its time every iteration...


To the point where I decided maybe I was asking too much of it and I stopped it. It returned 30,000 rows of data that it had grabbed before I stopped it. That was cool!
However, when the function errors while it is iterating, it returns nothing. Is it possible to have it return the rows it does grab, and some identifier that would allow a user to keep going if they wanted to grab the rest?
There was a problem hiding this comment.
^^ This is literally one of the biggest problems we try to solve with targets in the fetch portion...if dataRetrieval could do some of that for us, it would be really neat!
There was a problem hiding this comment.
It doesn't seem like we can get an accurate progress bar because it's SUPER expensive for the database to pre-calculate how many pages it's going to return. I was going to mess with the display information eventually, but since it will never be terribly accurate, I'm having a hard time thinking of a reason to keep in a progress bar. I haven't messed with enough of the httr2 options yet to know what to fiddle with to make it a useful set of info to display.
There was a problem hiding this comment.
What I've been assuming those "ETA"s are -> they seem to me to be how long the previous call took, and it's an ETA on the current iteration (not the overall time since )
There was a problem hiding this comment.
Ahhh. Hmm....are pages organized in a consistent way? (a question for @mikemahoney218-usgs) Like say someone makes a big call that eventually times out after X pages have been completed...is there any way they can use the information from the completed pages to simply "continue" on downloading in a new call?
Co-authored-by: Elise Hinman <[email protected]>
R/construct_api_requests.R
Outdated
| all_properties <- names(schema$properties) | ||
|
|
||
| match.arg(properties, choices = c(all_properties, NA_character_), | ||
| several.ok = TRUE) |
There was a problem hiding this comment.
This is minor, so feel free to disregard unless it becomes a problem. But, nonetheless something I noticed while crashing around in here.
site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', properties = c("dog"))
I love the message that tells you what you can place in properties. However, one of the options is displayed as "NA". So I decided to try "NA", like this:
site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', properties = c("NA"))
...and it gave me a separate error:
Error in match.arg(properties, choices = c(all_properties, NA_character_), :
'arg' must be NULL or a character vector
At this point, someone should know to replace "NA" with NA_character_....and really just shouldn't mess with it at all if they want everything. I guess my point is that maybe "NA" shouldn't be in the list of allowable values, since it needs to be that specific NA_character_.
| #' "USGS-01645000"), | ||
| #' parameter_code = c("00060", "00010"), | ||
| #' limit = 500, | ||
| #' time = c("2023-01-01", "2024-01-01")) |
There was a problem hiding this comment.
Ah, so here's an example (sorry, I didn't look at this first). I tried this example with and without the limit, and it returned the same number of rows, which is what I would've expected with the bbox query. Strange behavior that the example above works with pages of 100 but not pages of 10,000?
First pass at:
read_NWIS_daily
read_NWIS_monitoring_location
read_NWIS_ts_meta
To use the inline R functionality in the roxygen2 (to take advantage of dynamically pulling in the param descriptions), I had to switch the roxygen from classic to markdown...which is why there are 75 changed files.