Enable overriding undocumented telemetry identifier with PIP_TELEMETRY_USER_AGENT_ID#13560
Enable overriding undocumented telemetry identifier with PIP_TELEMETRY_USER_AGENT_ID#13560cosmicexplorer wants to merge 4 commits intopypa:mainfrom
Conversation
|
It is not accurate that commit was pushed directly to master, it was submitted as a PR and reviewed #9987 (I'm not a pip committer and couldn't have pushed to master if I wanted) |
|
I'm extremely sorry. I will delete that immediately. I pinged you because I found it very confusing given the other work I've seen from you. |
|
I will review those PR comments now. I'm incredibly sorry once again and thanks so much for your very patient and thoughtful response. |
|
Ok, I have deleted the revert and retained the changes of #9987. Thanks so much @alex for linking me there. This now just proposes a separate environment variable to override the I can't reproduce the test failure and I am confident my changes continue to produce valid JSON for |
|
@cosmicexplorer FYI I fixed a typo in your post
I don't really understand why the While I see there being other motivations for wanting to override the telemetry part of the user agent string, I'm not sure I quite follow this reasoning. I wonder if it makes sense to appropriate #13038 or make a new issue where we can discuss the reasons and design for overriding the telemetry, rather than the technical issues of the PR. Overall, I am currently neither for or against this feature, but I will leave a quick review on an immediate technical issue I see with your code. |
| def user_agent_id() -> str: | ||
| return Telemetry.calculate_user_agent_id(os.environ) | ||
|
|
||
| CLOBBER_USER_AGENT_ENV_VAR: ClassVar[str] = "PIP_TELEMETRY_USER_AGENT_ID" |
There was a problem hiding this comment.
Whatever the option is names or becomes please make this a regular pip option, e.g. --telemetry-user-agent-id, it will therefore be accessible via command line, environmental variable (as PIP_TELEMETRY_USER_AGENT_ID), and configuration files.
You can see recent PRs #13534 and #13520 for adding new pip options.
| We now introduce `PIP_TELEMETRY_USER_AGENT_ID`, which will completely overwrite the | ||
| string transmitted to remote hosts that pip communicates it. To maintain backwards | ||
| compatibility, it is disabled by default, but can be set to the empty string or any | ||
| other value. |
There was a problem hiding this comment.
I am unconvinced by this design over having an option like "disable telemetry", what is the value in users being able to replace the user string with a static arbitrary string?
| FIXME: This string is currently propagated as a header into every single HTTP | ||
| request pip makes. It should really be subject to a formal PEP process | ||
| in order to ensure user consent around telemetry is respected. |
There was a problem hiding this comment.
Please propose this in https://discuss.python.org/c/packaging/14, not in a pip doc string. It's not clear to me that everyone will agree with this.
| Environment markers can therefore be viewed as a language for codebases to | ||
| communicate their build process and requirements | ||
| (https://packaging.python.org/en/latest/flow/#the-packaging-flow) to a generic | ||
| resolver. Spack specs | ||
| (https://spack.readthedocs.io/en/latest/spec_syntax.html#sec-specs) | ||
| are very similar in spirit, and arose precisely to codify the recursive | ||
| and conditional relationships across multi-language software stacks. | ||
|
|
||
| Extending this concept, we could consider the *user agent id* as a language for | ||
| resolvers to communicate their requirements to external services and | ||
| execution environments. For example, clients for the GNU Make Jobserver protocol use | ||
| an environment variable to indicate how the jobserver should communicate to them, | ||
| and to indicate the maximum bandwidth they can tolerate for parallel execution: | ||
| (https://www.gnu.org/software/make/manual/html_node/Job-Slots.html). | ||
|
|
||
| The pants build tool specifically highlights the risk of exposing proprietary | ||
| information through thoughtless telemetry (https://www.pantsbuild.org/stable/docs/using-pants/anonymous-telemetry). | ||
| As a result, not only do they explicitly specify the information being recorded | ||
| (https://www.pantsbuild.org/stable/docs/using-pants/anonymous-telemetry#what-data-is-sent), | ||
| but they additionally incorporate anonymity as an explicit design goal | ||
| (https://www.pantsbuild.org/stable/docs/using-pants/anonymous-telemetry#how-we-ensure-anonymity). |
There was a problem hiding this comment.
This seems all quite opinionated to be in a docstring, and doesn't recognize a malicious index could extract most of the information from a user by simply observing what wheels are downloaded.
| Negotiating standards around useful telemetry data for PyPI began here | ||
| (https://github.com/pypa/pip/issues/5499), but never became a full PEP. This commit | ||
| (https://github.com/pypa/pip/commit/f787788a65cf7a8af8b7ef9dc13c0681d94fff5f) added | ||
| the output of an arbitrary subprocess execution into the string pip attaches to |
There was a problem hiding this comment.
It's not an "arbitrary" subprocess execution, it's an explicitly named subprocess execution.
| # because some CI systems mimic a tty (e.g. Travis CI). Thus that | ||
| # method doesn't provide definitive information in either direction. | ||
| return any(name in os.environ for name in CI_ENVIRONMENT_VARIABLES) | ||
| _rustc_output_regex: ClassVar[re.Pattern[str]] = re.compile(r"^rustc ([^\s]+)") |
There was a problem hiding this comment.
If there is a concern about what the rustc execution could be sending to the index then I am willing to volunteer reviewing a separate PR that adds more strict validation (presumably based on rustc documentation).
But as best as I can tell this is just a regex implementation of the prior code:
if rustc_output.startswith(b"rustc "):
# The format of `rustc --version` is:
# `b'rustc 1.52.1 (9bc8c42bb 2021-05-09)\n'`
# We extract just the middle (1.52.1) part
data["rustc_version"] = rustc_output.split(b" ")[1].decode()IMO the prior code had multiple advantages:
- It documents it's goal
- I don't have to understand a regex to understand the code
- It didn't require decoding the whole string (which if we don't trust the string seems better?)
I would at least want the documented goal of this back.
Fixes #13038.
The commit f787788 by @alex performed a
PATHtraversal and subsequent process execution (the output ofrustc --version) in order to add some more information to theUser-Agentrequest header that pip provides to its remote HTTP requests.f787788 makes reference to information valuable to package maintainers, which appears to be propagated to PyPI's BigQuery output: #9987 (comment) (thanks so much to Alex for explaining this to me -- I'm not sure why the commit message didn't have a link to his PR discussion).
I had previously proposed removing the
rustc --versioncall, but that has been removed and this PR now just enables overriding the telemetry identifier, as in title.A discussion of standardized telemetry data began to form in #5499, but failed to graduate into a full PEP. Given the very high quality of other Python packaging standards, I would like to propose that this one also be made into a full-on standard, probably somewhere around the Simple Repository API page.
This PR does two things:
PIP_TELEMETRY_USER_AGENT_ID, an environment variable which completely overrides the value used to form pip'sUser-Agentheader.