Skip to content

New API services!#773

Merged
ldecicco-USGS merged 56 commits intoDOI-USGS:developfrom
ldecicco-USGS:dv_service
May 28, 2025
Merged

New API services!#773
ldecicco-USGS merged 56 commits intoDOI-USGS:developfrom
ldecicco-USGS:dv_service

Conversation

@ldecicco-USGS
Copy link
Copy Markdown
Collaborator

@ldecicco-USGS ldecicco-USGS commented May 16, 2025

First pass at:

read_NWIS_daily
read_NWIS_monitoring_location
read_NWIS_ts_meta

To use the inline R functionality in the roxygen2 (to take advantage of dynamically pulling in the param descriptions), I had to switch the roxygen from classic to markdown...which is why there are 75 changed files.

POST = TRUE
}

time_periods <- c("last_modified", "datetime", "time")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also "begin" and "end" on time-series-metadata

Copy link
Copy Markdown

@mikemahoney218-usgs mikemahoney218-usgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realized too late I was slowly working through the full code and should actually group comments into a review instead

format_api_dates <- function(datetime){

if(!any(isTRUE(is.na(datetime)) | isTRUE(is.null(datetime)))){
if(length(datetime) == 1){
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this work for duration requests -- eg, time=P1D?


explode_post <- function(ls){

ls <- Filter(Negate(anyNA), ls)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love seeing Negate() in the wild


#' Check OGC requests
#'
#' @param endpoint Character, can be "daily", "timeseries-metadata"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' @param endpoint Character, can be "daily", "timeseries-metadata"
#' @param endpoint Character, can be "daily", "time-series-metadata", "monitoring-locations"

Might be worth genericizing to "can be any existing collection"?


service <- "daily"

dv_req <- construct_api_requests(service = service,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for what it's worth, I've used a much less readable but self-maintaining approach with do.call in the past
https://github.com/Permian-Global-Research/rsi/blob/main/R/get_stac_data.R#L702-L705

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slick

R/walk_pages.R Outdated
make_request <- httr2::req_url(req = req,
url = json_content$links[, "href", drop = TRUE][next_page])

Tailcall(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw that Tailcall is an R 4.4 feature -- might be worth moving to a loop instead to keep backwards compatibility?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang it, I only learned it from you but it's a sweet feature!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Tailcall and Exec are experimental and may be changed or dropped in future released versions of R."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm personally betting that note will be there in 20 years 😆

Copy link
Copy Markdown
Collaborator

@ehinman ehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still doing some testing and such, but here are a few thoughts/discoveries.

R/walk_pages.R Outdated

walk_pages <- function(req) {

walk_pages_recursive(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does walk_pages_recursive() need to be in a wrapper function?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure this is derived from the version I sent Laura a while back -- and I think I set it up this way so that you can initialize contents and page outside of the recursive context, so they don't get reset and you don't need global assignment. I think

```


# API Tokens
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest placing this at the top somewhere. I could see someone running through this vignette, hitting an error with the first example, and not knowing that they need to scroll down to find the solution.

@@ -0,0 +1,102 @@
#' Get USGS Daily Data
#'
#' Description `r get_description("daily")`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the first sentence of the description confusing: "Daily data provide one data value to represent water conditions for the day." I'd add "per parameter" or similar. I know this is at the API level. @mikemahoney218-usgs thoughts?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We steal this from WDFN:
https://waterdata.usgs.gov/data-collection-categories//#daily-data

They also provide multiple different statistics -- this paragraph makes more sense if you change the opening to "Daily data time series provide"

If we've got suggestions, I'm happy to ping the UI team -- though I think this page is a Leah project and I don't know how much attention it'll get before she's back!

#' @param properties The properties that should be included for each feature.
#' The parameter value is a comma-separated list of property names. Available options are
#' `r schema <- check_OGC_requests(endpoint = "daily", type = "schema"); paste(names(schema$properties), collapse = ", ")`
#' @param bbox Only features that have a geometry that intersects the bounding
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me out and enumerate which northing/easting/southing/westing goes where in the bbox order? I always forget. Or at least link out to the answer?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format is a string consisting of:

Western-most longitude

Southern-most latitude

Eastern-most longitude

Northern-most longitude

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xmin,ymin,xmax,ymax

I know this is specified somewhere... but I don't have the standard number to hand.

Something that might be wroth flagging is that sf::st_bbox will automatically be in the right sequence (because it uses the same standard)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still rooting for a template here in the description.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add an example in the monitoring-location function (since hopefully more people would use bbox's in that service compared to the daily service). But I'll add some manual text to the param descriptions.

#' parameter_code = c("00060", "00065"),
#' datetime = c(start_date, end_date))
#'
construct_api_requests <- function(service,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What am I doing wrong here? I tried both bboxes just in case I had it wrong. Errored on both.

bbox_stuff = c(-94.16516107720525, 34.985431123844755, -92.32139202893917, 36.13639585246413)

bbox_stuff = c("-94.16516107720525", "34.985431123844755", "-92.32139202893917", "36.13639585246413")


bbox_test <- construct_api_requests(service = "daily", bbox = bbox_stuff, parameter_code = '00060')

Returning:
"Error in if (all(is.na(full_list)) & is.na(bbox)) { :
the condition has length > 1"

Seems to be around line 66 of construct_api_requests

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why and how did this work!? I accidentally put in monitoring_location_identifier and it told me it was an unused argument, but it was okay with monitoring_location.

site_test <- read_USGS_daily(monitoring_location = "USGS-06137570", parameter_code = '00060')

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

monitoring_location_id is the full parameter -- you partially matched it with the second example

We use ID instead of identifier throughout because of the geojson standard suggesting "id" as a top-level parameter. It's a bit of an infelicity though, considering we use very few abbreviations otherwise

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops sorry about that, pushed up a change yesterday and didn't test the bbox for side affects

#' box are selected.The bounding box is provided as four or six numbers, depending
#' on whether the coordinate reference system includes a vertical axis (height or
#' depth).
#' @param properties The properties that should be included for each feature. The
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are properties the same as columns in this case, and features represent each....row? I find the term "property" understandable in terms of a JSON response, but maybe a little less familiar as another term for "column". Could definitely get used to it though!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly that! GeoJSON stashes attributes under the "properties" key, so if you're making direct API calls it makes sense that you'd filter that array down using the properties argument.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I toyed initially with the idea of calling the argument "columns", but don't want to sway too far from the actual service. How about we be more explicit in the descriptions, but still call the actual argument "properties"?

#' @param properties A vector of requested columns to be returned from the query.
#' Available options are: 

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution. I appreciate sticking to the API service!

#' time = c("2023-01-01", "2024-01-01"))
#'
#' }
read_USGS_daily <- function(monitoring_location_id = NA_character_,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place I'm getting an error:

site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', statistic_id = "00002")
Function in development, use at your own risk.
GET: https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570&parameter_code=00060&statistic_id=00002
Error in order(return_list$time, return_list$monitoring_location_id) : 
  argument 1 is not a vector
In addition: Warning messages:
1: In min(bb[, 1L], na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
2: In min(bb[, 2L], na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
3: In max(bb[, 3L], na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf
4: In max(bb[, 4L], na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf
5: Unknown or uninitialised column: `time`. 
6: Unknown or uninitialised column: `monitoring_location_id`. 

It does work if you wrap the statistic_id into a c().

site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', statistic_id = c("00003"))
Function in development, use at your own risk.
GET: https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570&parameter_code=00060&statistic_id=00003

This is also the same error you'd get if you try a non-existent statistic id. I just tried c("3") and it gave the same error. I understand why type checking isn't here yet, but something to consider.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm....also here. I first just put in a YYYY-MM-DD and got an error, consulted the help, then tried a single datetime, and that failed, then tried a bounded interval:

site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', last_modified = "2018-02-12T00:00:00Z/2024-03-18T12:31:12Z")
Function in development, use at your own risk.
GET: https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570&parameter_code=00060&last_modified=2018-02-12T00%3A00%3A00Z%2F2024-03-18T12%3A31%3A12Z
Error in order(return_list$time, return_list$monitoring_location_id) : 
  argument 1 is not a vector
In addition: Warning messages:
1: In min(bb[, 1L], na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
2: In min(bb[, 2L], na.rm = TRUE) :
  no non-missing arguments to min; returning Inf
3: In max(bb[, 3L], na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf
4: In max(bb[, 4L], na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf
5: Unknown or uninitialised column: `time`. 
6: Unknown or uninitialised column: `monitoring_location_id`. 

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first example, you asked for a stat code that the site doesn't offer - so a great example of a valid request with no returned data. I'll work up a better response. It's something to think about.

  1. The service itself gives some json:
{
    "type":"FeatureCollection",
    "features":[],
    "numberReturned":0,
    "links":[
        {
            "type":"application/geo+json",
            "rel":"self",
            "title":"This document as GeoJSON",
            "href":"https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570&parameter_code=00060&statistic_id=00002"
        },
        {
            "rel":"alternate",
            "type":"application/ld+json",
            "title":"This document as RDF (JSON-LD)",
            "href":"https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=jsonld&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570&parameter_code=00060&statistic_id=00002"
        },
        {
            "type":"text/html",
            "rel":"alternate",
            "title":"This document as HTML",
            "href":"https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=html&lang=en-US&limit=10000&monitoring_location_id=USGS-06137570&parameter_code=00060&statistic_id=00002"
        },
        {
            "type":"application/json",
            "title":"Daily values",
            "rel":"collection",
            "href":"https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily"
        }
    ],
    "timeStamp":"2025-05-23T14:18:51.680649Z"
}

Which, we can pretty easily check for the "numberReturned":0,. HOWEVER, historically dataRetrieval users really like their empty requests to come back with an empty data frame so they can have it in a big ol' loop and just keep binding rows together.

Side note: I'm about to convert all the readNWISdv and readNWISsite functions in the testthat tests to these functions - so there will probably be other error handling stuff like this fixed today.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly that 2nd example had no data that was last modified between 2018 and 2024. If you changed it to:

site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", 
                             parameter_code = '00060',
                             last_modified = "2024-03-18T12:31:12Z/2025-05-18T12:31:12Z")

then you'd get some results. So, number 1 priority this morning is the no data result. Give me a little bit to think about that....

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all work great now, with the first one returning a df of 0 rows. Thanks @ldecicco-USGS!!

#' multi_site <- read_USGS_monitoring_location(state_name = "Wisconsin")
#'
#' }
read_USGS_monitoring_location <- function(monitoring_location_id = NA_character_,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran into an issue with this call (is it the query rate limiter or...?):

sites = whatNWISsites(statecd = "NY")
sites = paste0("USGS-", sites$site_no)

blah <- read_USGS_monitoring_location(monitoring_location_id = sites, skipGeometry = TRUE)
Function in development, use at your own risk.
POST: https://api.waterdata.usgs.gov/ogcapi/v0/collections/monitoring-locations/items?f=json&lang=en-US&skipGeometry=TRUE&limit=10000
Error in `httr2::req_perform()` at dataRetrieval/R/walk_pages.R:53:3:
! HTTP 403 Forbidden.
Run `rlang::last_trace()` to see where the error occurred.
> rlang::last_trace()
<error/httr2_http_403>
Error in `httr2::req_perform()` at dataRetrieval/R/walk_pages.R:53:3:
! HTTP 403 Forbidden.
---
Backtrace:
    ▆
 1. └─dataRetrieval::read_USGS_monitoring_location(...)
 2.   └─dataRetrieval:::walk_pages(site_req) at dataRetrieval/R/read_USGS_monitoring_location.R:132:3
 3.     └─dataRetrieval:::walk_pages_recursive(req = req, page = 1, contents = list()) at dataRetrieval/R/walk_pages.R:34:3
 4.       └─httr2::req_perform(req) at dataRetrieval/R/walk_pages.R:53:3
Run rlang::last_trace(drop = FALSE) to see 3 hidden frames.

Just interesting that if I use sites[1:100] as an input, it works fine, even with a POST. sites is 88,922 sites long, so maybe I'm hitting the limit of what the database accepts.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I ask why you're querying for a specific 88,922 sites? I have a hard time imagining the use case for that one 😅

Copy link
Copy Markdown
Collaborator

@ehinman ehinman May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOL. I didn't realize you could query by state_name. Shame on me. Probably unlikely, but I probably wouldn't be the first, either. I was mimicking a workflow I've seen to grab all the site numbers within a state to then pull peaks data, and I wanted to see how it would handle a long set of sites. Yep, 89,000 is excessive.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But another good reason for req_error() handling!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree -- but our suggestion to this user is going to be "consider splitting your request into multiple pieces"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message looks good upon retrying it. Thanks!

read_USGS_monitoring_location <- function(monitoring_location_id = NA_character_,
agency_code = NA_character_,
agency_name = NA_character_,
monitoring_location_number = NA_character_,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that there's id and number. My guess is that hardcore legacy users will want to stick with number.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data furnished from other agencies can have the same site number as USGS sites. Monitoring location IDs are guaranteed to be unique across the entire network.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@ehinman
Copy link
Copy Markdown
Collaborator

ehinman commented May 23, 2025

AH! Yeah, that sounds cool!

I'm not sure if I think this is more confusing to have these "shadow" functions. The pros are that it keeps some "normalcy" for what was, but the cons are that it keeps people from fully adopting the new function. Feel free to push back, I just would rather limit the number of available options.

@ldecicco-USGS
Copy link
Copy Markdown
Collaborator Author

AH! Yeah, that sounds cool!

I'm not sure if I think this is more confusing to have these "shadow" functions. The pros are that it keeps some "normalcy" for what was, but the cons are that it keeps people from fully adopting the new function. Feel free to push back, I just would rather limit the number of available options.

Yup, I think after some back-and-forth scattered above I've talked myself out of it too.

@jzemmels
Copy link
Copy Markdown
Collaborator

jzemmels commented May 23, 2025

AH! Yeah, that sounds cool!

I'm not sure if I think this is more confusing to have these "shadow" functions. The pros are that it keeps some "normalcy" for what was, but the cons are that it keeps people from fully adopting the new function. Feel free to push back, I just would rather limit the number of available options.

I consider it as being the new function, just with a name that evokes the old function that may be easier for users to remember. I'll admit that my desire to add this is based on a frustration I sometimes have with tidyverse and other packages where they change a function's based on what they consider best. It took me a long time to remember pivot_wider and pivot_longer when I had originally learned gather and spread -- and I'm sure those functions frustrated the melt and cast crowd.

I'm thinking that long-time dataRetrieval users will first remember dv when they're trying to find something before they remember that it's now called daily -- especially because "daily" was actually a different endpoint in the legacy API. Plus it's "cheap" to add a function like this since it's literally just a carbon copy of another. I won't be upset if I'm overruled on this, but I personally consider it to be a low-hanging, nice-to-have for onboarding old users to the new functions.

@ldecicco-USGS
Copy link
Copy Markdown
Collaborator Author

OK folks! I know it's Friday afternoon before a 3 day weekend, so before I forget:

I added a little bit of error handling here:
https://github.com/ldecicco-USGS/dataRetrieval/blob/dv_service/R/construct_api_requests.R#255
We can obviously expand what gets checked, but lets be careful with what we add. I'm sure there are plenty of errors that could mean several things. I don't want to lead users astray if the top Google hit of the server error message tells them exactly what they need to know.

I switched to httr2::req_perform_iterative, so the custom recursive function is gone. It doesn't reduce THAT much code, but keeps more things in the httr2 world. Note, there's usually a function httr2::resps_data that a lot of the httr2 examples use, but I found several cases where it had a really hard time matching up column types from one page to the next, so we still need to do a custom function for that:
https://github.com/ldecicco-USGS/dataRetrieval/blob/dv_service/R/construct_api_requests.R#L74

We also return an empty data frame now if a user makes a valid request for no data (like, it's a site with data, but not during the requested time period).

So give it another good test next week.

@ldecicco-USGS
Copy link
Copy Markdown
Collaborator Author

Unless I hear objections, I'll plan to merge this tomorrow morning (5/28). That will make it easier to get a few more folks testing the new functions, and we'll be able to have small digestible PRs going forward (updates to the vignette, tests for some of the internal functions, etc...)

#' box are selected.The bounding box is provided as four or six numbers, depending
#' on whether the coordinate reference system includes a vertical axis (height or
#' depth). Coordinates are assumed to be in crs 4326.
#' @param limit The optional limit parameter limits the number of items that are
Copy link
Copy Markdown
Collaborator

@ehinman ehinman May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about what this means. I tried this:

bbox <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5), parameter_code = c("00060", "00065"), limit = 100)

And it was super fast, and returned exactly 2000 rows, which seemed weird, but I'll roll with it.

I then tried this (no limit):

bbox <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5), parameter_code = c("00060", "00065"))

And it cannot seem to complete. I tried running it twice. Here's the error I received:
Screenshot 2025-05-27 at 11 54 44 AM

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to train users to use the read_USGS_ts_meta function to find the info they want first.

bbox_vals = c(-83, 36.5, -82, 37.5)

multi_site <- read_USGS_ts_meta(bbox = bbox_vals, 
                                parameter_code = c("00060", "00065"))

library(dplyr)
ids_I_want <- multi_site |> 
  filter(end >= as.Date("2025-05-01"),
         computation_identifier != "Instantaneous")
nrow(ids_I_want)
168

Note, that is still a TON of data (1 of those sites claims it goes back to 1791?), it's going to exceed the amount we can ask for in 1 go.

But say you do rejigger the id's and find exactly the ones you are interested, then you could do this and avoid using the bounding box on the daily service.

multi_site_dv <- read_USGS_daily(time_series_id = ids_I_want$time_series_metadata_id[1:10])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I like this solution.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm still confused on what limit means, though. My limit was 100 and I got 2000 rows returned. What is the math on that?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm guessing... it seems like we can get 20 pages at most. At least in my little bit of testing, I'm seeing 20 pages coming back without a warning that you are missing data. I'll try to confirm, but here: what I see:

bbox_lim_100 <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5), 
                          parameter_code = c("00060", "00065"), 
                          limit = 100)
nrow(bbox_lim_100)
2000

bbox_lim_500 <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5), 
                          parameter_code = c("00060", "00065"), 
                          limit = 500)
nrow(bbox_lim_500)
10000

bbox_lim_1000 <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5), 
                          parameter_code = c("00060", "00065"), 
                          limit = 1000)
nrow(bbox_lim_1000)
20000

bbox_lim_2000 <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5), 
                          parameter_code = c("00060", "00065"), 
                          limit = 2000)
nrow(bbox_lim_2000)
40000

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HA! That's on me! I didn't realize there was a limit on the req_perform_iterative function:

req_perform_iterative(
  req,
  next_req,
  path = NULL,
  max_reqs = 20, # <- WHOOOPS!
  on_error = c("stop", "return"),
  progress = TRUE
)


return_list <- rejigger_cols(return_list, properties, service)

return(return_list)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation: I tried making a big-ish call (again, maybe I'm being unreasonable here, but that's not THAT big of a bbox)

bbox <- read_USGS_daily(bbox = c(-83, 36.5, -81, 38.5))

I loved the progress bar, but it kept upping its time every iteration...
Screenshot 2025-05-27 at 11 42 51 AM
Screenshot 2025-05-27 at 11 41 52 AM

To the point where I decided maybe I was asking too much of it and I stopped it. It returned 30,000 rows of data that it had grabbed before I stopped it. That was cool!

However, when the function errors while it is iterating, it returns nothing. Is it possible to have it return the rows it does grab, and some identifier that would allow a user to keep going if they wanted to grab the rest?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ This is literally one of the biggest problems we try to solve with targets in the fetch portion...if dataRetrieval could do some of that for us, it would be really neat!

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like we can get an accurate progress bar because it's SUPER expensive for the database to pre-calculate how many pages it's going to return. I was going to mess with the display information eventually, but since it will never be terribly accurate, I'm having a hard time thinking of a reason to keep in a progress bar. I haven't messed with enough of the httr2 options yet to know what to fiddle with to make it a useful set of info to display.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've been assuming those "ETA"s are -> they seem to me to be how long the previous call took, and it's an ETA on the current iteration (not the overall time since )

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh. Hmm....are pages organized in a consistent way? (a question for @mikemahoney218-usgs) Like say someone makes a big call that eventually times out after X pages have been completed...is there any way they can use the information from the completed pages to simply "continue" on downloading in a new call?

all_properties <- names(schema$properties)

match.arg(properties, choices = c(all_properties, NA_character_),
several.ok = TRUE)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is minor, so feel free to disregard unless it becomes a problem. But, nonetheless something I noticed while crashing around in here.

site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', properties = c("dog"))

I love the message that tells you what you can place in properties. However, one of the options is displayed as "NA". So I decided to try "NA", like this:

site_test <- read_USGS_daily(monitoring_location_id = "USGS-06137570", parameter_code = '00060', properties = c("NA"))

...and it gave me a separate error:

Error in match.arg(properties, choices = c(all_properties, NA_character_),  : 
  'arg' must be NULL or a character vector

At this point, someone should know to replace "NA" with NA_character_....and really just shouldn't mess with it at all if they want everything. I guess my point is that maybe "NA" shouldn't be in the list of allowable values, since it needs to be that specific NA_character_.

#' "USGS-01645000"),
#' parameter_code = c("00060", "00010"),
#' limit = 500,
#' time = c("2023-01-01", "2024-01-01"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so here's an example (sorry, I didn't look at this first). I tried this example with and without the limit, and it returned the same number of rows, which is what I would've expected with the bbox query. Strange behavior that the example above works with pages of 100 but not pages of 10,000?

@ldecicco-USGS ldecicco-USGS merged commit 4f0a42e into DOI-USGS:develop May 28, 2025
1 check passed
@ldecicco-USGS ldecicco-USGS deleted the dv_service branch June 2, 2025 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants