Skip to content

Conversation

art1f1c3R
Copy link
Member

@art1f1c3R art1f1c3R commented Sep 5, 2025

Summary

The goal is to improve the efficiency of the SimilarProjectsAnalyzer, which currently downloads the sourcecode tarball for every package of every maintainer it finds. The solution to this is to use the structure of the tarball/wheel provided on the inspector.pypi.io page of the package, making web requests to extract the structure instead of downloading the package.

Description of changes

This PR modifies the way PyPI inspector links are handled by created a separate PyPIInspectorAsset object, container information about the PyPI inspector URLs and with the ability to extract the project structure from a package URL. The WheelAbsenceAnalyzer is then modified to use this, simplifying it, and the SimilarProjectsAnalyzer then makes use of it for analyzing the package structure.

The SimilarProjectsAnalyzer normalizes the structure by doing the following:

  • Only considering python files.
  • Removing the <package_name>-<version> prefix.
  • Removing the <package_name from the top-level folder, resulting in a folder structure that does not contain the package name at the top level.
  • Removing setup.py from tarballs.

This makes it so that wheels and tarballs are comparable when looking at the package structure. A unit test is written to demonstrate this. The SimilarProjectsAnalyzer then extracts the hash for these folder structures, and compares them against other projects made by the maintainers of the analyzed package. If at least one is similar, the analyzer fails, but it does continue to loop and collect all similar projects.

A known complication with this is the fact that PyPI uses the Fastly CDN, returning a JavaScript challenge response. Since PyPI inspector uses URLs rerouted from PyPI, this means those JavaScript challenges are received when making programmatic requests in python to a PyPI inspector URL. This does not always happen, but is a frequent occurrence. To accommodate for this, this analyzer is written such that it does not raise HeuristicAnalyzerValueErrors, and will return SKIP results when unable to obtain package information.

Checklist

  • I have reviewed the contribution guide.
  • My PR title and commits follow the Conventional Commits convention.
  • My commits include the "Signed-off-by" line.
  • I have signed my commits following the instructions provided by GitHub. Note that we run GitHub's commit verification tool to check the commit signatures. A green verified label should appear next to all of your commits on GitHub.
  • I have updated the relevant documentation, if applicable.
  • I have tested my changes and verified they work as expected.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Sep 5, 2025
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/similar-projects-efficiencies branch from ed4d2de to b7c115f Compare September 5, 2025 02:10
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/similar-projects-efficiencies branch 2 times, most recently from 8336056 to b8347ad Compare October 3, 2025 04:53
… pypi registry code

Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/similar-projects-efficiencies branch from d096a31 to c7b95e6 Compare October 10, 2025 05:22
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
@art1f1c3R art1f1c3R marked this pull request as ready for review October 10, 2025 06:26
@art1f1c3R art1f1c3R requested a review from behnazh-w as a code owner October 10, 2025 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant