chore: improving efficiency of similar projects analyzer #1170
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
The goal is to improve the efficiency of the
SimilarProjectsAnalyzer
, which currently downloads the sourcecode tarball for every package of every maintainer it finds. The solution to this is to use the structure of the tarball/wheel provided on theinspector.pypi.io
page of the package, making web requests to extract the structure instead of downloading the package.Description of changes
This PR modifies the way PyPI inspector links are handled by created a separate
PyPIInspectorAsset
object, container information about the PyPI inspector URLs and with the ability to extract the project structure from a package URL. TheWheelAbsenceAnalyzer
is then modified to use this, simplifying it, and theSimilarProjectsAnalyzer
then makes use of it for analyzing the package structure.The
SimilarProjectsAnalyzer
normalizes the structure by doing the following:<package_name>-<version>
prefix.<package_name
from the top-level folder, resulting in a folder structure that does not contain the package name at the top level.setup.py
from tarballs.This makes it so that wheels and tarballs are comparable when looking at the package structure. A unit test is written to demonstrate this. The
SimilarProjectsAnalyzer
then extracts the hash for these folder structures, and compares them against other projects made by the maintainers of the analyzed package. If at least one is similar, the analyzer fails, but it does continue to loop and collect all similar projects.A known complication with this is the fact that PyPI uses the Fastly CDN, returning a JavaScript challenge response. Since PyPI inspector uses URLs rerouted from PyPI, this means those JavaScript challenges are received when making programmatic requests in python to a PyPI inspector URL. This does not always happen, but is a frequent occurrence. To accommodate for this, this analyzer is written such that it does not raise
HeuristicAnalyzerValueError
s, and will returnSKIP
results when unable to obtain package information.Checklist
verified
label should appear next to all of your commits on GitHub.