-
-
Notifications
You must be signed in to change notification settings - Fork 176
Description
- I have checked the existing issues to avoid duplicates
- I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue
Describe the bug
The DHT crawler filters newly discovered nodes before forwarding them to the rest of the crawler pipeline. Unfortunately, this filtering is based on on the table.addrs
address map, which doesn't contain all nodes seen by the crawler recently.
This results in the crawler attempting to scrape the same nodes 50+ times in a few minutes from my testing. In the long run, only 70% of nodes that the discovered_nodes
component forwards to the rest of the pipeline are actually new nodes.
To Reproduce
Steps to reproduce the behavior:
N/A
Expected behavior
The filtering should be a bit more advanced. I implemented a very naive replacement that just maintains a set of all discovered nodes and skips nodes that have already been seen. This resulted in a ~80% increase in discovered torrents per unit of time.
This implementation is far from ideal. For one, it leads to an always growing memory consumption, eventually resulting in OOMs in my setup. For two, it doesn't discriminate based on where the node eventually was forwarded to - it might be interesting to sample infohashes for a node that was previously just pinged. Finally, nodes should not be excluded forever just because they were visited once; an actual implementation should have a timeout after which a node can be visited again.
Environment Information (Required)
- Bitmagnet version: latest git
- OS and version: Linux 6.14.10-arch1-1
- Please specify any config values for which you have overridden the defaults: the longer running tests were run with
dht_crawler.scaling_factor = 20
, but I observed similar results with other scaling factors during shorter runs too.