Skip to content

Discovered nodes filtering isn't strict enough #433

@abitofevrything

Description

@abitofevrything
  • I have checked the existing issues to avoid duplicates
  • I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue

Describe the bug

The DHT crawler filters newly discovered nodes before forwarding them to the rest of the crawler pipeline. Unfortunately, this filtering is based on on the table.addrs address map, which doesn't contain all nodes seen by the crawler recently.

This results in the crawler attempting to scrape the same nodes 50+ times in a few minutes from my testing. In the long run, only 70% of nodes that the discovered_nodes component forwards to the rest of the pipeline are actually new nodes.

To Reproduce

Steps to reproduce the behavior:

N/A

Expected behavior

The filtering should be a bit more advanced. I implemented a very naive replacement that just maintains a set of all discovered nodes and skips nodes that have already been seen. This resulted in a ~80% increase in discovered torrents per unit of time.

This implementation is far from ideal. For one, it leads to an always growing memory consumption, eventually resulting in OOMs in my setup. For two, it doesn't discriminate based on where the node eventually was forwarded to - it might be interesting to sample infohashes for a node that was previously just pinged. Finally, nodes should not be excluded forever just because they were visited once; an actual implementation should have a timeout after which a node can be visited again.

Environment Information (Required)

  • Bitmagnet version: latest git
  • OS and version: Linux 6.14.10-arch1-1
  • Please specify any config values for which you have overridden the defaults: the longer running tests were run with dht_crawler.scaling_factor = 20, but I observed similar results with other scaling factors during shorter runs too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions