Discovered nodes filtering isn't strict enough

- [x] I have checked the existing issues to avoid duplicates
- [x] I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue

### Describe the bug

The DHT crawler [filters newly discovered nodes](https://github.com/bitmagnet-io/bitmagnet/blob/f4e4600d516a5d013b1a3c9b88ef0c4036228843/internal/dhtcrawler/discovered_nodes.go#L50) before forwarding them to the rest of the crawler pipeline. Unfortunately, this filtering is based on on the `table.addrs` address map, which doesn't contain all nodes seen by the crawler recently.

This results in the crawler attempting to scrape the same nodes 50+ times in a few minutes from my testing. In the long run, only 70% of nodes that the `discovered_nodes` component forwards to the rest of the pipeline are _actually_ new nodes.

### To Reproduce

Steps to reproduce the behavior:

N/A

### Expected behavior

The filtering should be a bit more advanced. I implemented [a very naive replacement](https://github.com/bitmagnet-io/bitmagnet/compare/main...abitofevrything:bitmagnet:feat/skip-more-nodes#diff-69fc9252e11488f41ee46ce3c543d3c0f08b336ff1472c99ad3c16e0d0d3a842) that just maintains a set of all discovered nodes and skips nodes that have already been seen. This resulted in a ~80% increase in discovered torrents per unit of time.

This implementation is far from ideal. For one, it leads to an always growing memory consumption, eventually resulting in OOMs in my setup. For two, it doesn't discriminate based on where the node eventually was forwarded to - it might be interesting to sample infohashes for a node that was previously just pinged. Finally, nodes should not be excluded _forever_ just because they were visited once; an actual implementation should have a timeout after which a node can be visited again.

### Environment Information (Required)

- Bitmagnet version: latest git
- OS and version: Linux 6.14.10-arch1-1 
- Please specify any config values for which you have overridden the defaults: the longer running tests were run with `dht_crawler.scaling_factor = 20`, but I observed similar results with other scaling factors during shorter runs too.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Discovered nodes filtering isn't strict enough #433

Describe the bug

To Reproduce

Expected behavior

Environment Information (Required)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Discovered nodes filtering isn't strict enough #433

Description

Describe the bug

To Reproduce

Expected behavior

Environment Information (Required)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions