Skip to content

Conversation

NimaSarajpoor
Copy link
Collaborator

@NimaSarajpoor NimaSarajpoor commented Oct 5, 2025

We may end up with using different algos for calculating sliding-dot-product for arrays with different lengths. However, it would be great to see if one algo, like pyfftw_sdp, could outperform MATLAB's FFTW-based sliding dot product. It seems that pyfftw_sdp is slightly outperformed by MATLAB's fftw_sdp when the length of array is < 2^8.

After checking the performance of (R)FFT and I(R)FFT in #19 and having some discussion in pyFFTW/pyFFTW#425 (comment), I decided to see if I can boost the performance of pyfftw_sdp.

Copy link

gitnotebooks bot commented Oct 5, 2025

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Oct 5, 2025

Tests are passing! Now, let's look at the performance. The data is saved in timing.csv.

# file: timing.sh

rm -rf sdp/__pycache__
./timing.py -niter 100 -pmin 2 -pmax 20 pyfftw challenger > timing.csv
rm -rf sdp/__pycache__
image

We can see some improvement particularly for cases where len(T) < 2^9

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Oct 5, 2025

Now, let's check the performance on MATLAB Online. I've checked the following cases:

  • MATLAB fftw_sdp (Single Thread)
  • MATLAB fftw_sdp (Auto Threads)
  • MATLAB fftw_sdp (Max Threads, 512)
  • pyfftw_sdp (Single Thread)
  • challenger_sdp (Single Thread)

We will use the first case, i.e. MATLAB fftw_sdp (Single Thread) as the base, and will compare the performance of others w.r.t. that.

The code for MATLAB's fftw_sdp is provided below:

% ref: DAMP supporting materials

function [Z] = SLIDING_DOT_PRODUCT_MASS_V2(Q, T)
    m = length(Q);
    n = length(T);

    % reverse query and append zero
    Q = Q(end:-1:1);
    Q(m+1:n) = 0;

    %The main trick of getting dot products 
    % in O(n log n) time
    F_T = fft(T);
    F_Q = fft(Q);
    F_Z = F_T.*F_Q;
    Z = ifft(F_Z);
    Z = Z(m:n);

end

Each of the images below is for one len(T). The blue curve shows the performance of existing pyfftw_sdp, and the red curve shows the performance of challenger_sdp.

image image image image image image image image image image

@seanlaw
Copy link
Contributor

seanlaw commented Oct 5, 2025

@NimaSarajpoor Can you please provide some conclusions/observations based on the outputs above?

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Oct 5, 2025

The following observation is based on the results I obtained for T with length in {2^2, 2^3, ..., 2^20}. The proposed (single-threaded) challenger_sdp outperforms MATLAB's FFT-based (single-threaded & multi-threaded) sdp. So, it should be safe to replace the existing pyfftw_sdp module with the one proposed in challenger_sdp.py. Then, when we try to implement an algo in python (e.g. DAMP) and compare its performance with the one implemented in MATLAB, we know that there will be less chance of getting performance hit because of the sdp part.

I think there are still two items that we need to check:

  • How does the proposed algo perform for longer arrays, where len(T) > 2^20.
  • If we pre-compute the FFT plans for arrays with different lengths, should we keep the object or save the wisdom (and import later)? In the latter case, we need to see how much overhead the "import" part has on the performance.

@seanlaw
What do you think?

@seanlaw
Copy link
Contributor

seanlaw commented Oct 6, 2025

How does the proposed algo perform for longer arrays, where len(T) > 2^20.

Yes, let's see the results for len(T) > 2^20.

If we pre-compute the FFT plans for arrays with different lengths, should we keep the object or save the wisdom (and import later)? In the latter case, we need to see how much overhead the "import" part has on the performance.

Do we know if the saved wisdom is portable/transferable across different architectures/OS (i.e., is it simply a file that gets read)? I'd assume that the object is not portable/transferable. Given that we might use a Class, can we add an ability to cache the wisdom/object (in a Dict) after it has been computed the first time at different lengths? We can also add a class method to precompute the wisdoms once (and possibly save them) AND add another method that allows the user to load wisdom into the class if they've precomputed themselves.

So, creating the wisdom at runtime might be slow but probably still fine if the wisdom is reused many times. Then, the user can choose to precompute the wisdom themselves BEFORE they call SDP for a range of len(T). Technically, this will take the same amount of time as the first sentence in this paragraph but, in both cases, we should give the user the ability to save the wisdom/object and load it back later.

My guess is that importing is negligible compared to the time it takes to compute the wisdom and/or to complete DAMP 🤷‍♂️

Does that help?

@NimaSarajpoor
Copy link
Collaborator Author

How does the proposed algo perform for longer arrays, where len(T) > 2^20.

Yes, let's see the results for len(T) > 2^20.

As shown below, in most cases, challenger_sdp is >= 1. For longer arrays, the difference is negligible. That's okay. The main goal here is to see if the performance for shorter arrays can be improved without losing the performance for longer arrays. So, I think we are good here.

./timing.py -pmin 2 -pmax 25 pyfftw challenger > timing.csv

image

Do we know if the saved wisdom is portable/transferable across different architectures/OS (i.e., is it simply a file that gets read)?

The exported wisdom cannot be used in a different architecture. I checked this by exporting wisdom in Colab (linux), saved the file, and then tried to use it in my macOS.

Export wisdom in Colab (linux):

input_data = pyfftw.empty_aligned(64, dtype=np.float64)
output_data = pyfftw.empty_aligned(1 + 64//2, dtype=np.complex128)

fft_object = pyfftw.FFTW(
    input_data,
    output_data,
    direction='FFTW_FORWARD',
)

wisdom = pyfftw.export_wisdom()
with open('./fft_wisdom.dat', 'wb') as f:
  pickle.dump(wisdom, f)

And then tried to use wisdom in my macOS:

with open('./fft_wisdom.dat', 'rb') as f:
    wisdom = pickle.load(f)
pyfftw.import_wisdom(wisdom)

input_data = pyfftw.empty_aligned(64, dtype=np.float64)
output_data = pyfftw.empty_aligned(1 + 64//2, dtype=np.complex128)
fft_object = pyfftw.FFTW(
    input_data,
    output_data,
    direction='FFTW_FORWARD',
    flags=('FFTW_WISDOM_ONLY',)  # Raises ERROR if wisdom does not exist
)

Given that we might use a Class, can we add an ability to cache the wisdom/object (in a Dict) after it has been computed the first time at different lengths? We can also add a class method to precompute the wisdoms once (and possibly save them) AND add another method that allows the user to load wisdom into the class if they've precomputed themselves.

According to the pyfftw-based sdp, we basically need to store two wisdom/object for len(T)==2^k. So, for 2^2 to 2^25, we basically need 24 keys, and each key can be mapped to a tuple like (RFFT wisdom, IRFFT wisdom).

So, creating the wisdom at runtime might be slow but probably still fine if the wisdom is reused many times. Then, the user can choose to precompute the wisdom themselves BEFORE they call SDP for a range of len(T). Technically, this will take the same amount of time as the first sentence in this paragraph but, in both cases, we should give the user the ability to save the wisdom/object and load it back later.

Makes sense.

My guess is that importing is negligible compared to the time it takes to compute the wisdom and/or to complete DAMP

TBD

@seanlaw
Copy link
Contributor

seanlaw commented Oct 7, 2025

As shown below, in most cases, challenger_sdp is >= 1. For longer arrays, the difference is negligible. That's okay. The main goal here is to see if the performance for shorter arrays can be improved without losing the performance for longer arrays. So, I think we are good here.

Awesome! One thing that I worry about is the installation of FFTW. I know it's not REALLY our problem but we should probably point people to the best way to get it installed since it doesn't come with pyfftw

@NimaSarajpoor
Copy link
Collaborator Author

NimaSarajpoor commented Oct 19, 2025

WIP

I know it's not REALLY our problem but we should probably point people to the best way to get it installed since it doesn't come with pyfftw

[Note to Self] In MATLAB and Colab, FFTW is already available. This can be checked by doing apt search libfftw3.

My guess is that importing is negligible compared to the time it takes to compute the wisdom and/or to complete DAMP

TBD

I have been thinking what should be our baseline here. If I am interested in performing FFT on arrays with different lengths, then I am basically interested in the following baseline:

p_min, p_max = 2, 15
p_range = range(p_min, p_max + 1)

timing = np.full(len(p_range), -1.0, dtype=np.float64)
for i, p in enumerate(p_range):
    T = np.random.rand(2 ** p).astype(np.float64)
    
    start = time.perf_counter()  # START
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = pyfftw.builders.rfft(
        input_arr,
        overwrite_input=True,
        avoid_copy=True,
        n=n,
        threads=1,
    )
    input_arr[:] = T
    rfft_obj.execute()
    stop = time.perf_counter()  # STOP
    
    timing[i] = stop - start

There is no optimization here... simply it creates the object and performs the transformation. Now, I am trying to check two things:

(1) How does the performance look like if I export wisdom in advance, and use it later.
So, I can simply add wisdom_dict[p] = pyfftw.export_wisdom() at the end of the script above to save wisdoms for different lengths, and save it via pickle.

with open('wisdom_dict.pkl', 'wb') as f:
    pickle.dump(wisdom_dict, f)

Then, read it in a new process. The new process looks like the baseline above but with this difference that it reads wisdoms from the saved pickle file, and import it.

# load wisdom
with open('wisdom_files/wisdom_dict.pkl', 'rb') as f:
        wisdom_dict = pickle.load(f)

p_min, p_max = 2, 15
p_range = range(p_min, p_max + 1)

timing = np.full(len(p_range), -1.0, dtype=np.float64)
for i, p in enumerate(p_range):
    T = np.random.rand(2 ** p).astype(np.float64)
    
    start = time.perf_counter()  # START
    pyfftw.import_wisdom(wisdom_dict[p])  # IMPORT WISDOM
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = pyfftw.builders.rfft(
        input_arr,
        overwrite_input=True,
        avoid_copy=True,
        n=n,
        threads=1,
    )
    input_arr[:] = T
    rfft_obj.execute()
    stop = time.perf_counter()  # STOP
    
    timing[i] = stop - start

AFAIU, in contrast to pyfftw.FFTW, we cannot pass the flag FFTW_WISDOM_ONLY to pyfftw.builders.rfft to make sure it does not recompute wisdom. But we can understand this by timing the script above, and plot its performance w.r.t baseline. It should also help us understand the impact of pyfftw.import_wisdom(...) on the running time.

image

The running time of importing wisdom is considerable for smaller p values (Why? Shouldn't it improve the performance because some computations of "planning" are bypassed??) The computation of twiddle factors will happen here again because "WISDOM" is only about planning, which basically tells FFTW what algo it should use for a particular array.

(2) What if I store rfft object? In this case, I create the objects first, and store them in a dictionary. Then, I use those objects in the for-loop.

p_min, p_max = 2, 15
p_range = range(p_min, p_max + 1)

rfft_dict = {}
for p in p_range:
    T = np.random.rand(2 ** p).astype(np.float64)
   
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = pyfftw.builders.rfft(
        input_arr,
        overwrite_input=True,
        avoid_copy=True,
        n=n,
        threads=1,
    )
    input_arr[:] = T
    rfft_obj.execute()
    rfft_dict[p] = rfft_obj
    

timing = np.full(len(p_range), -1.0, dtype=np.float64)
for i, p in enumerate(p_range):
    T = np.random.rand(2 ** p).astype(np.float64)
    
    start = time.perf_counter()
    n = len(T) 
    input_arr = pyfftw.empty_aligned(n, dtype=np.float64)
    rfft_obj = rfft_dict[p]
    input_arr[:] = T
    rfft_obj.execute()
    stop = time.perf_counter()
    
    timing[i] = stop - start

np.save(f'numpy_files/PreComputeObj_timing_{itr}.npy', timing)

image

@seanlaw
Copy link
Contributor

seanlaw commented Oct 21, 2025

What if I store rfft object? In this case, I create the objects first, and store them in a dictionary. Then, I use those objects in the for-loop.

That's speedup from storing the rfft object is pretty insane!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants