Skip to content

Conversation

malteos
Copy link
Collaborator

@malteos malteos commented Sep 19, 2025

The PR adds two new commands to the CLI:

  • filter_cdx: Filter CDX files based on a given URL or SURT white list.
  • warc_by_cdx: Fetch WARC files like warc but instead of filtering based on a single URL this commands filters based on a given CDX file.

Both commands support reading and writing to local or remote paths (like S3) and gzip compression. To make the code more readable, parts of cli.py were moved to utils.py.

Example usage

Filter CDX files from S3 based on URL or SURT whitelist (input and output paths can be local or remote, e.g., S3):

# - one CDX file takes ~ 200 seconds with a whitelist of 1M SURTs)
# - for all 300 cdx files from CC-MAIN-2024-30 take  ~ 1.4 hrs with the EOTA2024 list and an c5n.xlarge instance
cdxt -v filter_cdx \
    s3://commoncrawl/cc-index/collections \
    s3://commoncrawl-dev/eot-archive/malte-test/ccf-gov-federal-web-graph-2024-jun-jul-aug.txt \
    s3://commoncrawl-dev/eot-archive/malte-test/filtered-cdxs --filter-type url \
    --input-glob "/CC-MAIN-2024-30/indexes/*.gz" --overwrite

Fetch WARC records based on filtered CDX (output is written to a local dir or S3)

# Input and output paths can be local or remote, e.g., S3.
cdxt -v --cc  warc_by_cdx \
    s3://commoncrawl-dev/eot-archive/malte-test/filtered-cdxs --index-glob "*.gz" \
    --prefix ./output/filtered-warcs/ \
    --warc-download-prefix=s3://commoncrawl \
   --creator foo --operator bob

For better throughput for S3 read/write, we have also a specific implementation using aioboto3 that you can enable with the --implementation=aioboto3 argument:

#  fetching takes ~ 10 hours for 3M records (filter from above) using an c5n.xlarge instance
cdxt -v --cc  warc_by_cdx \
    s3://commoncrawl-dev/eot-archive/malte-test/filtered-cdxs --index-glob "*.gz" \
    --prefix s3://commoncrawl-dev/eot-archive/malte-test/filtered-warcs/ \
    --warc-download-prefix=s3://commoncrawl \
   --creator foo --operator bob --implementation=aioboto3

To maintain the filter inputs, e.g., the whitelist list, we can add one or multiple files with metadata as resource records to the extracted WARC. To do this, you need to provide the corresponding file paths as arguments --write-paths-as-resource-records=s3://commoncrawl-dev/eot-archive/malte-test/ccf-gov-federal-web-graph-2024-jun-jul-aug.txt and --write-paths-as-resource-records-metadata=path/to/metadata.json. The metadata file is optional and can have the following optional fields:

- warc_content_type: str
- uri: str
- http_headers: dict
- warc_headers_dict: str

This in one example for a metadata JSON file:

{
    "uri": "filter_cdx.cdx.gz",
    "warc_content_type": "application/cdx",
}

The full WARC extraction command would look like this:

cdxt -v --cc  warc_by_cdx \
    s3://commoncrawl-dev/eot-archive/malte-test/filtered-cdxs --index-glob "*.gz" \
    --prefix s3://commoncrawl-dev/eot-archive/malte-test/filtered-warcs/ \
    --warc-download-prefix=s3://commoncrawl \
   --creator foo --operator bob \
   --write-paths-as-resource-records=s3://commoncrawl-dev/eot-archive/malte-test/ccf-gov-federal-web-graph-2024-jun-jul-aug.txt \
   --write-paths-as-resource-records-metadata=path/to/metadata.json

TODOs

  • filter CDX: glob via fspec to support remote input paths, like S3.
  • warc by CDX: support multiple input CDX files
  • unit tests
  • formatting + linting (wait until feat: Adding linting with flake8 #49 is merged)

@malteos malteos changed the title feat: Adding filter_cdx and warc_by_cdx commands feat: Adding filter_cdx and warc_by_cdx commands (2) Sep 19, 2025
Copy link

codecov bot commented Sep 19, 2025

Codecov Report

❌ Patch coverage is 97.20000% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.46%. Comparing base (b1fea6c) to head (6441bf6).

Files with missing lines Patch % Lines
cdx_toolkit/warcer_by_cdx/aioboto3_warcer.py 94.80% 8 Missing ⚠️
cdx_toolkit/warcer_by_cdx/__init__.py 92.30% 5 Missing ⚠️
cdx_toolkit/warcer_by_cdx/fsspec_warcer.py 95.83% 3 Missing ⚠️
cdx_toolkit/filter_cdx/__init__.py 98.21% 2 Missing ⚠️
cdx_toolkit/filter_cdx/matcher.py 96.36% 2 Missing ⚠️
cdx_toolkit/cli.py 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #54      +/-   ##
==========================================
+ Coverage   95.89%   96.46%   +0.57%     
==========================================
  Files           7       19      +12     
  Lines         876     1583     +707     
==========================================
+ Hits          840     1527     +687     
- Misses         36       56      +20     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant