-
Notifications
You must be signed in to change notification settings - Fork 94
Description
Hey,
I am currently using the Rust API for trying to process slightly larger datasets (a few thousand json files with a few hundred MB total) with a few queries (100+). As this problem is basically embarrassingly parallel, I parallelized my program over the queries, with worker threads each polling queries from a queue, running the query over all inputs and then collecting the results. This leads to the following flow:
- Read all input files and queries
- Parse all input files to json (
jaq_json
'sVal
) - Clone this into each worker thread, so they have the inputs lying around
- In each worker:
- Pick a query to run from the queue
- Clone your inputs, run the query against it, keep the result
- Loop
- Collect the results from all queries and return it
This breaks completely at step 3 and 5, as Val
is not Send
due to the Rc
and can not be passed to a different thread. Even if I move the json strings over, and re-parse everything in each worker thread (which takes quite a while…), this problem persists: I can not get the results back from the worker thread!
My first solution was defining my own JSON type that I could convert from and to jaq_json
's Val
type, but the conversion and dropping costs of that approach lead to an order of magnitude slower execution.
My current solution basically copies jaq_json
and replaces every Rc
with Arc
.
Could I talk you into defining a
#[cfg(feature = "sync-val")]
type Rc = Arc
#[cfg(not(feature = "sync-val"))]
type Rc = Rc
or something similiar to allow users to define this? As cargo features are additive, users might pay a small penalty if any crate in their graph uses parallel processing, but that sounds acceptable to me.
I can PR a feature like that if you want.