Skip to content

Parallel processing with jaq_json #323

@I-Al-Istannen

Description

@I-Al-Istannen

Hey,

I am currently using the Rust API for trying to process slightly larger datasets (a few thousand json files with a few hundred MB total) with a few queries (100+). As this problem is basically embarrassingly parallel, I parallelized my program over the queries, with worker threads each polling queries from a queue, running the query over all inputs and then collecting the results. This leads to the following flow:

  1. Read all input files and queries
  2. Parse all input files to json (jaq_json's Val)
  3. Clone this into each worker thread, so they have the inputs lying around
  4. In each worker:
    1. Pick a query to run from the queue
    2. Clone your inputs, run the query against it, keep the result
    3. Loop
  5. Collect the results from all queries and return it

This breaks completely at step 3 and 5, as Val is not Send due to the Rc and can not be passed to a different thread. Even if I move the json strings over, and re-parse everything in each worker thread (which takes quite a while…), this problem persists: I can not get the results back from the worker thread!

My first solution was defining my own JSON type that I could convert from and to jaq_json's Val type, but the conversion and dropping costs of that approach lead to an order of magnitude slower execution.
My current solution basically copies jaq_json and replaces every Rc with Arc.

Could I talk you into defining a

#[cfg(feature = "sync-val")]
type Rc = Arc
#[cfg(not(feature = "sync-val"))]
type Rc = Rc

or something similiar to allow users to define this? As cargo features are additive, users might pay a small penalty if any crate in their graph uses parallel processing, but that sounds acceptable to me.

I can PR a feature like that if you want.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions