Boosting the performance of `pyfftw_sdp.py` #20

NimaSarajpoor · 2025-10-05T05:55:47Z

We may end up with using different algos for calculating sliding-dot-product for arrays with different lengths. However, it would be great to see if one algo, like pyfftw_sdp, could outperform MATLAB's FFTW-based sliding dot product. It seems that pyfftw_sdp is slightly outperformed by MATLAB's fftw_sdp when the length of array is < 2^8.

After checking the performance of (R)FFT and I(R)FFT in #19 and having some discussion in pyFFTW/pyFFTW#425 (comment), I decided to see if I can boost the performance of pyfftw_sdp.

gitnotebooks · 2025-10-05T05:55:51Z

Review these changes at https://app.gitnotebooks.com/stumpy-dev/sliding_dot_product/pull/20

NimaSarajpoor · 2025-10-05T06:39:41Z

Tests are passing! Now, let's look at the performance. The data is saved in timing.csv.

# file: timing.sh

rm -rf sdp/__pycache__
./timing.py -niter 100 -pmin 2 -pmax 20 pyfftw challenger > timing.csv
rm -rf sdp/__pycache__

We can see some improvement particularly for cases where len(T) < 2^9

NimaSarajpoor · 2025-10-05T07:18:48Z

Now, let's check the performance on MATLAB Online. I've checked the following cases:

MATLAB fftw_sdp (Single Thread)
MATLAB fftw_sdp (Auto Threads)
MATLAB fftw_sdp (Max Threads, 512)
pyfftw_sdp (Single Thread)
challenger_sdp (Single Thread)

We will use the first case, i.e. MATLAB fftw_sdp (Single Thread) as the base, and will compare the performance of others w.r.t. that.

The code for MATLAB's fftw_sdp is provided below:

% ref: DAMP supporting materials

function [Z] = SLIDING_DOT_PRODUCT_MASS_V2(Q, T)
    m = length(Q);
    n = length(T);

    % reverse query and append zero
    Q = Q(end:-1:1);
    Q(m+1:n) = 0;

    %The main trick of getting dot products 
    % in O(n log n) time
    F_T = fft(T);
    F_Q = fft(Q);
    F_Z = F_T.*F_Q;
    Z = ifft(F_Z);
    Z = Z(m:n);

end

Each of the images below is for one len(T). The blue curve shows the performance of existing pyfftw_sdp, and the red curve shows the performance of challenger_sdp.

seanlaw · 2025-10-05T11:26:13Z

@NimaSarajpoor Can you please provide some conclusions/observations based on the outputs above?

NimaSarajpoor · 2025-10-05T17:38:05Z

The following observation is based on the results I obtained for T with length in {2^2, 2^3, ..., 2^20}. The proposed (single-threaded) challenger_sdp outperforms MATLAB's FFT-based (single-threaded & multi-threaded) sdp. So, it should be safe to replace the existing pyfftw_sdp module with the one proposed in challenger_sdp.py. Then, when we try to implement an algo in python (e.g. DAMP) and compare its performance with the one implemented in MATLAB, we know that there will be less chance of getting performance hit because of the sdp part.

I think there are still two items that we need to check:

How does the proposed algo perform for longer arrays, where len(T) > 2^20.
If we pre-compute the FFT plans for arrays with different lengths, should we keep the object or save the wisdom (and import later)? In the latter case, we need to see how much overhead the "import" part has on the performance.

@seanlaw
What do you think?

seanlaw · 2025-10-06T00:23:29Z

How does the proposed algo perform for longer arrays, where len(T) > 2^20.

Yes, let's see the results for len(T) > 2^20.

If we pre-compute the FFT plans for arrays with different lengths, should we keep the object or save the wisdom (and import later)? In the latter case, we need to see how much overhead the "import" part has on the performance.

Do we know if the saved wisdom is portable/transferable across different architectures/OS (i.e., is it simply a file that gets read)? I'd assume that the object is not portable/transferable. Given that we might use a Class, can we add an ability to cache the wisdom/object (in a Dict) after it has been computed the first time at different lengths? We can also add a class method to precompute the wisdoms once (and possibly save them) AND add another method that allows the user to load wisdom into the class if they've precomputed themselves.

So, creating the wisdom at runtime might be slow but probably still fine if the wisdom is reused many times. Then, the user can choose to precompute the wisdom themselves BEFORE they call SDP for a range of len(T). Technically, this will take the same amount of time as the first sentence in this paragraph but, in both cases, we should give the user the ability to save the wisdom/object and load it back later.

My guess is that importing is negligible compared to the time it takes to compute the wisdom and/or to complete DAMP 🤷‍♂️

Does that help?

NimaSarajpoor · 2025-10-07T05:09:15Z

How does the proposed algo perform for longer arrays, where len(T) > 2^20.

Yes, let's see the results for len(T) > 2^20.

As shown below, in most cases, challenger_sdp is >= 1. For longer arrays, the difference is negligible. That's okay. The main goal here is to see if the performance for shorter arrays can be improved without losing the performance for longer arrays. So, I think we are good here.

./timing.py -pmin 2 -pmax 25 pyfftw challenger > timing.csv

Do we know if the saved wisdom is portable/transferable across different architectures/OS (i.e., is it simply a file that gets read)?

The exported wisdom cannot be used in a different architecture. I checked this by exporting wisdom in Colab (linux), saved the file, and then tried to use it in my macOS.

Export wisdom in Colab (linux):

input_data = pyfftw.empty_aligned(64, dtype=np.float64)
output_data = pyfftw.empty_aligned(1 + 64//2, dtype=np.complex128)

fft_object = pyfftw.FFTW(
    input_data,
    output_data,
    direction='FFTW_FORWARD',
)

wisdom = pyfftw.export_wisdom()
with open('./fft_wisdom.dat', 'wb') as f:
  pickle.dump(wisdom, f)

And then tried to use wisdom in my macOS:

with open('./fft_wisdom.dat', 'rb') as f:
    wisdom = pickle.load(f)
pyfftw.import_wisdom(wisdom)

input_data = pyfftw.empty_aligned(64, dtype=np.float64)
output_data = pyfftw.empty_aligned(1 + 64//2, dtype=np.complex128)
fft_object = pyfftw.FFTW(
    input_data,
    output_data,
    direction='FFTW_FORWARD',
    flags=('FFTW_WISDOM_ONLY',)  # Raises ERROR if wisdom does not exist
)

Given that we might use a Class, can we add an ability to cache the wisdom/object (in a Dict) after it has been computed the first time at different lengths? We can also add a class method to precompute the wisdoms once (and possibly save them) AND add another method that allows the user to load wisdom into the class if they've precomputed themselves.

According to the pyfftw-based sdp, we basically need to store two wisdom/object for len(T)==2^k. So, for 2^2 to 2^25, we basically need 24 keys, and each key can be mapped to a tuple like (RFFT wisdom, IRFFT wisdom).

So, creating the wisdom at runtime might be slow but probably still fine if the wisdom is reused many times. Then, the user can choose to precompute the wisdom themselves BEFORE they call SDP for a range of len(T). Technically, this will take the same amount of time as the first sentence in this paragraph but, in both cases, we should give the user the ability to save the wisdom/object and load it back later.

Makes sense.

My guess is that importing is negligible compared to the time it takes to compute the wisdom and/or to complete DAMP

TBD

seanlaw · 2025-10-07T11:10:07Z

As shown below, in most cases, challenger_sdp is >= 1. For longer arrays, the difference is negligible. That's okay. The main goal here is to see if the performance for shorter arrays can be improved without losing the performance for longer arrays. So, I think we are good here.

Awesome! One thing that I worry about is the installation of FFTW. I know it's not REALLY our problem but we should probably point people to the best way to get it installed since it doesn't come with pyfftw

NimaSarajpoor · 2025-10-19T17:47:17Z

WIP

I know it's not REALLY our problem but we should probably point people to the best way to get it installed since it doesn't come with pyfftw

[Note to Self] In MATLAB and Colab, FFTW is already available. This can be checked by doing apt search libfftw3.

My guess is that importing is negligible compared to the time it takes to compute the wisdom and/or to complete DAMP

TBD

I have been thinking what should be our baseline here. If I am interested in performing FFT on arrays with different lengths, then I am basically interested in the following baseline:

p_min, p_max = 2, 15
p_range = range(p_min, p_max + 1)

timing = np.full(len(p_range), -1.0, dtype=np.float64)
for i, p in enumerate(p_range):
    T = np.random.rand(2 ** p).astype(np.float64)
    
    start = time.perf_counter()  # START
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = pyfftw.builders.rfft(
        input_arr,
        overwrite_input=True,
        avoid_copy=True,
        n=n,
        threads=1,
    )
    input_arr[:] = T
    rfft_obj.execute()
    stop = time.perf_counter()  # STOP
    
    timing[i] = stop - start

There is no optimization here... simply it creates the object and performs the transformation. Now, I am trying to check two things:

(1) How does the performance look like if I export wisdom in advance, and use it later.
So, I can simply add wisdom_dict[p] = pyfftw.export_wisdom() at the end of the script above to save wisdoms for different lengths, and save it via pickle.

with open('wisdom_dict.pkl', 'wb') as f:
    pickle.dump(wisdom_dict, f)

Then, read it in a new process. The new process looks like the baseline above but with this difference that it reads wisdoms from the saved pickle file, and import it.

# load wisdom
with open('wisdom_files/wisdom_dict.pkl', 'rb') as f:
        wisdom_dict = pickle.load(f)

p_min, p_max = 2, 15
p_range = range(p_min, p_max + 1)

timing = np.full(len(p_range), -1.0, dtype=np.float64)
for i, p in enumerate(p_range):
    T = np.random.rand(2 ** p).astype(np.float64)
    
    start = time.perf_counter()  # START
    pyfftw.import_wisdom(wisdom_dict[p])  # IMPORT WISDOM
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = pyfftw.builders.rfft(
        input_arr,
        overwrite_input=True,
        avoid_copy=True,
        n=n,
        threads=1,
    )
    input_arr[:] = T
    rfft_obj.execute()
    stop = time.perf_counter()  # STOP
    
    timing[i] = stop - start

AFAIU, in contrast to pyfftw.FFTW, we cannot pass the flag FFTW_WISDOM_ONLY to pyfftw.builders.rfft to make sure it does not recompute wisdom. But we can understand this by timing the script above, and plot its performance w.r.t baseline. It should also help us understand the impact of pyfftw.import_wisdom(...) on the running time.

The running time of importing wisdom is considerable for smaller p values (Why? Shouldn't it improve the performance because some computations of "planning" are bypassed??) The computation of twiddle factors will happen here again because "WISDOM" is only about planning, which basically tells FFTW what algo it should use for a particular array.

(2) What if I store rfft object? In this case, I create the objects first, and store them in a dictionary. Then, I use those objects in the for-loop.

p_min, p_max = 2, 15
p_range = range(p_min, p_max + 1)

rfft_dict = {}
for p in p_range:
    T = np.random.rand(2 ** p).astype(np.float64)
   
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = pyfftw.builders.rfft(
        input_arr,
        overwrite_input=True,
        avoid_copy=True,
        n=n,
        threads=1,
    )
    input_arr[:] = T
    rfft_obj.execute()
    rfft_dict[p] = rfft_obj
    

timing = np.full(len(p_range), -1.0, dtype=np.float64)
for i, p in enumerate(p_range):
    T = np.random.rand(2 ** p).astype(np.float64)
    
    start = time.perf_counter()
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = rfft_dict[p]
    input_arr[:] = T
    rfft_obj.execute()
    stop = time.perf_counter()
    
    timing[i] = stop - start

np.save(f'numpy_files/PreComputeObj_timing_{itr}.npy', timing)

seanlaw · 2025-10-21T03:06:28Z

What if I store rfft object? In this case, I create the objects first, and store them in a dictionary. Then, I use those objects in the for-loop.

That's speedup from storing the rfft object is pretty insane!

proposing new fftw-based sdp

c64956a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Boosting the performance of `pyfftw_sdp.py` #20

Boosting the performance of `pyfftw_sdp.py` #20

Uh oh!

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

Uh oh!

gitnotebooks bot commented Oct 5, 2025

Uh oh!

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

Uh oh!

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

Uh oh!

seanlaw commented Oct 5, 2025

Uh oh!

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

Uh oh!

seanlaw commented Oct 6, 2025 •

edited

Loading

Uh oh!

NimaSarajpoor commented Oct 7, 2025

Uh oh!

seanlaw commented Oct 7, 2025 •

edited

Loading

Uh oh!

NimaSarajpoor commented Oct 19, 2025 •

edited

Loading

Uh oh!

seanlaw commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Boosting the performance of pyfftw_sdp.py #20

Are you sure you want to change the base?

Boosting the performance of pyfftw_sdp.py #20

Uh oh!

Conversation

NimaSarajpoor commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gitnotebooks bot commented Oct 5, 2025

Uh oh!

NimaSarajpoor commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Oct 5, 2025

Uh oh!

NimaSarajpoor commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seanlaw commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Oct 7, 2025

Uh oh!

seanlaw commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NimaSarajpoor commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

WIP

Uh oh!

seanlaw commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Boosting the performance of `pyfftw_sdp.py` #20

Boosting the performance of `pyfftw_sdp.py` #20

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

NimaSarajpoor commented Oct 5, 2025 •

edited

Loading

seanlaw commented Oct 6, 2025 •

edited

Loading

seanlaw commented Oct 7, 2025 •

edited

Loading

NimaSarajpoor commented Oct 19, 2025 •

edited

Loading