Spec
All tests are done in 720P on my seven-year-old E3-1231+GTX960 PC, Windows 10 21H1

Shaders
For realtime blur, you can use gaussian blur shader (DX11 ONLY)
https://github.com/defisym/OpenFusio...s/Gauss%20Blur

GaussBlur 2D: traditional gauss blur. very slow.
GaussBlur: Do X 1D blur then do Y. cause fusion now cannot attach multiple shader to an object, shader need to re-calc X blur for every pixel in radius. 10X faster than first one.
GaussBlur 1D: only do 1D blur for backdrop. you need two object to blur backdrop: first is set to X blur and second is set to Y blur. about 20X faster than the first one

Shader Name Radius Frame Rate
GaussBlur 2D 30 2
GaussBlur 30 23
GaussBlur 1D 30 60



But if you set radius greater than 40, even GaussBlur 1D will drop to 40 FPS on my PC. One solution is to use multiple ones with small radius to blur, another solotion, if you don't need realtime blur, e.g. for pause menu backdrop, you can use the extension:

Extension
Source code:https://github.com/defisym/OpenFusio...ensions/WinAPI
Release:https://github.com/defisym/OpenFusio...WinAPI_B210905

I implemented three alogrithm (in theory you can easliy transplant it to android, etc as I only used STL and cSurface), and the fastest one is stack blur. To use it, firstly check the display checkbox in the object property, then load an image to it, e.g. capture frame area. the last step is use the stack blur action.
blur action has three params. first is radius. the time of stackblur is radius-independent. in theory the bigger R is the more time it costs, but only 1ms slow (R=10 vs R=250). the second is downscaling scale, you can just keep it to 1.0. It actually had some effect with default SDK settings (reduce about 20% time if set to 2.0), but after enable some optimization options like /Ob /Oi, etc, the stack blur itself is faster enough (2X faster than default settings), downscale and resizing to display will cost even more time. the last one is threads, keep it to -1 to use maxium threads of your PC. For me, my PC is 4C8T, run with single thread will cost 37ms, while 8 thread costs 17ms. (but 4 thread also costs 17ms. so the more thread is not the better)
the algothrim itself can be optimised further (e.g. GPU accrelation) but it's currently beyond my ability as a newbie.

for more informations about performance benchmark, please check this link:
https://github.com/defisym/OpenFusio...WinAPI/BlurCMP

Shader Vs Extension
Radius set to 50, please check the video below: