Stopping the flicker when you restyle a video frame by frame

wpnews.pro

Run a diffusion restyle on every frame of a clip, one frame at a time, and the still images look great. Then you play them back and the whole thing boils. Textures crawl, colors pulse, a brick wall shifts its grout lines every frame. The model did nothing wrong on any single frame. It just made a slightly different choice each time, and at 24 frames a second your eye reads those differences as flicker.

This is a walkthrough of the code that kills that flicker. The trick is to stop restyling every frame. Stylize a few frames, then carry the style to the rest by following the motion. The interesting part is the bookkeeping and the blending that make the carry invisible, so I will spend most of the post there.

Each frame is a separate sample from the model, so each frame lands in a slightly different place. Played in sequence, those small differences become flicker. Photo: Unsplash.

A diffusion model starts from noise and walks toward an image. Two frames of a video that look almost identical to you are still two different starting points and two different walks. The model has no memory of what it drew last frame, so it picks a fresh interpretation of "oil painting" or "anime" each time. On a still you never notice. In motion you see the model changing its mind 24 times a second.

You can lower the denoise strength so the model stays close to the input, but then you barely restyle anything. You can feed the previous frame back in, which helps a little and drifts a lot. The cleaner answer is structural: restyle only a sparse set of frames, and fill the gaps by warping a real stylized frame into place. A warped frame cannot disagree with itself between neighbors, because it is the same pixels pushed along the motion. This is the idea behind Rerender A Video (Yang et al., SIGGRAPH Asia 2023) and behind EbSynth (Jamriška et al., ACM ToG 2019), and it is what the code below implements.

Keyframes come from scene detection plus a fixed interval. Inside the scene detector:

if interval:
    scene_frames.extend(range(start_frame, end_frame, interval))
    scene_frames.append(end_frame - 1)

The default interval

is 10. So inside each detected scene you take every tenth frame, plus the last frame of the scene, as keyframes. Those are the only frames the diffusion model ever touches. Everything between two keyframes is going to be synthesized by warping, not by the model. Pick the interval too large and motion outruns the warp; pick it too small and you pay for diffusion you did not need. Ten is a reasonable middle for most footage.

Once you commit to "stylize keyframes, propagate the rest," you inherit a filing problem. For every gap between keyframe i

and keyframe i+1

you need the right input frames, the right output paths, the right optical-flow files, and the right guide images, in both the forward and backward direction. Get one index off and a frame lands in the wrong folder. VideoSequence

in video_sequence.py

is the class that does this filing.

It is constructed with the list of keyframes (here called frame_files_with_interval

) and it makes one output subdirectory per keyframe:

self.__frame_files_with_interval = [f for f in frame_files_with_interval if ".png" in f]
self.__n_seq = len(self.__frame_files_with_interval)
out_subdir = self.__get_out_subdir(frame_file)   # out_<keyframe-name>

The core method is get_input_sequence

. Given a gap index i

, it returns the list of input frame paths in that gap:

def get_input_sequence(self, i, is_forward=True):
    if i + 1 > len(self.__frame_files_with_interval) - 1:
        last_input_frame = self.__input_frames[-1]
        last_interval_frame = self.__frame_files_with_interval[i]
        if last_input_frame == last_interval_frame:
            return None
        else:
            beg_id = int(last_interval_frame.split(".")[0])
            end_id = int(last_input_frame.split(".")[0])
    else:
        beg_id = self.get_sequence_beg_id(i)
        end_id = self.get_sequence_beg_id(i + 1)
    if is_forward:
        id_list = list(range(beg_id, end_id))
    else:
        id_list = list(range(end_id, beg_id, -1))
    return [os.path.join(self.__input_dir, self.__input_format % id)
            for id in id_list if self.__input_format % id in self.__input_frames]

Two things to notice. First, beg_id

and end_id

come straight from the keyframe filenames, which are named by frame number (%04d.jpg

). The filename is the index. That is why the whole class can do its math on int(name.split(".")[0])

instead of carrying a separate table. Second, the is_forward

flag reverses the range. The pipeline propagates style from the left keyframe rightward, and from the right keyframe leftward, then meets in the middle. The backward pass needs the same frames in reverse, and this one flag gives both.

The same shape repeats for every artifact the propagation needs:

get_output_sequence

builds the destination paths inside the keyframe's out_

folder.get_flow_sequence

builds flow_f_%04d.npy

for forward and flow_b_%04d.npy

for backward.get_edge_sequence

, get_temporal_sequence

, get_pos_sequence

build the per-frame guide paths in the keyframe's tmp folder.One detail worth flagging: the flow lists are one element shorter than the frame lists. There are N frames but only N-1 motions between them:

if is_forward:
    id_list = list(range(beg_id, end_id - 1))      # forward flows: N-1
else:
    id_list = list(range(end_id, beg_id + 1, -1))  # backward flows: N-1

If you ever zip flows against frames and get an off-by-one, this is where it comes from. The class is built so the flow list and the warp loop line up.

The last gap is the awkward one. The final keyframe is rarely the literal last frame of the video, so the code special-cases it: when i+1

runs past the keyframe list, it uses self.__input_frames[-1]

, the true last frame, as the end of the gap. Without that branch the tail of every clip would go unstyled.

Propagation here is done EbSynth-style: you give EbSynth a source stylized image and a set of guide channels, and it synthesizes the target frame so it matches the style of the source while respecting the guides. The guides live in guide.py

. Each one answers a different question for the synthesizer.

The positional guide answers "where did this pixel come from?" It starts from a synthetic image where each pixel encodes its own coordinate as color, then warps that image along the optical flow, frame after frame:

@staticmethod
def __generate_first_img(H, W):
    Hs = np.linspace(0, 1, H)
    Ws = np.linspace(0, 1, W)
    i, j = np.meshgrid(Hs, Ws, indexing='ij')
    r = (i * 255).astype(np.uint8)   # row -> red
    g = (j * 255).astype(np.uint8)   # col -> green
    b = np.zeros(r.shape)
    return np.stack((b, g, r), 2)

Red is the row, green is the column. After you warp this map by the flow, a pixel's color tells you which original pixel ended up there. That is a dense, smooth correspondence field, which is exactly what a synthesizer wants so it does not invent new texture in moving regions.

The temporal guide answers "what did the previous stylized frame look like, moved to here?" It takes the previous stylized frame and warps it forward by the flow:

def get_cmd(self, i, weight) -> str:
    if i == 0:
        warped_img = self.stylized_imgs[0]
    else:
        prev_img = cv2.imread(self.stylized_imgs[i - 1])
        warped_img = self.flow_calc.warp(prev_img, self.flows[i - 1], 'nearest').astype(np.uint8)
        warped_img = cv2.inpaint(warped_img, self.masks[i - 1], 30, cv2.INPAINT_TELEA)
        cv2.imwrite(self.imgs[i], warped_img)
    return super().get_cmd(i, weight)

This is the anti-flicker guide. It pushes the synthesizer to make frame i

look like frame i-1

carried along the motion, so the style stays put on a surface as it moves instead of re-rolling every frame.

Both warps leave holes. Where motion uncovers a region the camera could not see last frame, the warp has no data, and the optical-flow mask marks those pixels. The fix is the same in both guides:

cur_img = cv2.inpaint(cur_img, mask, 30, cv2.INPAINT_TELEA)

cv2.INPAINT_TELEA

fills the disoccluded holes from their surroundings so the guide has no black gaps. A radius of 30 pixels is generous, which suits the smooth guide maps; you do not need sharp inpainting here, just plausible filler.

The other two guides are simpler. The edge guide runs a Laplacian-style filter so the synthesizer keeps structure aligned to the input:

filter = np.array([[0, -1, 0], [-1, 4, -1], [0, -1, 0]])
res = cv2.filter2D(img, -1, filter)

And the color guide is just the raw frames, so the synthesizer has the original colors to refer to. Each guide carries a -weight

, so you can dial how strongly the synthesizer listens to motion versus structure versus color.

Now you have two stylized versions of every in-between frame: one propagated forward from the left keyframe, one propagated backward from the right keyframe. They agree on geometry but rarely on color, because each picked up the tint of a different keyframe along the way. Stack them naively and you get a visible color step. histogram_blend.py

fixes the color before the seam gets stitched.

It works in Lab color space, which separates lightness from color so you can match tone without muddying brightness:

a = cv2.cvtColor(a, cv2.COLOR_BGR2Lab)
b = cv2.cvtColor(b, cv2.COLOR_BGR2Lab)
t_mean_val = 0.5 * 256
t_std_val = (1 / 36) * 256
a = histogram_transform(a, a_mean, a_std, t_mean, t_std)
b = histogram_transform(b, b_mean, b_std, t_mean, t_std)
ab = (a * weight1 + b * weight2 - t_mean_val) / 0.5 + t_mean_val
ab = histogram_transform(ab, ab_mean, ab_std, min_error_mean, min_error_std)

The shape is: push both images to the same neutral mean and standard deviation, average them, then push the average to the statistics of min_error

, the frame the pipeline trusts most for this position. The two odd constants are a target mean of 0.5 * 256

(mid-gray) and a target std of (1/36) * 256

. They are just a stable common ground to average in; the final re-keying is what makes the result match a real frame rather than a washed-out midpoint. This is per-channel mean/std transfer, the same idea as classic Reinhard color transfer, done twice.

Color matching makes the two halves agree on average. It does not hide the actual boundary where forward meets backward. For that the pipeline pastes in the gradient domain, the same Poisson Image Editing idea from Pérez, Gangnet and Blake (SIGGRAPH 2003) that photo tools use to drop an object into a new background without a halo.

The principle: do not copy pixels, copy differences between pixels. Build a target gradient field by taking gradients from image 1 outside the mask and image 2 inside it, then solve for the image whose gradients match that field. Seams disappear because you never enforce an absolute pixel value at the boundary, only the slope across it.

def poisson_fusion(blendI, I1, I2, mask, grad_weight=[2.5, 0.5, 0.5]):
    Iab = cv2.cvtColor(blendI, cv2.COLOR_BGR2LAB).astype(float)
    Ia  = cv2.cvtColor(I1, cv2.COLOR_BGR2LAB).astype(float)
    Ib  = cv2.cvtColor(I2, cv2.COLOR_BGR2LAB).astype(float)
    m = (mask > 0).astype(float)[:, :, np.newaxis]

    gx[:-1] = (Ia[:-1] - Ia[1:]) * (1 - m[:-1]) + (Ib[:-1] - Ib[1:]) * m[:-1]
    gy[:, :-1] = (Ia[:, :-1] - Ia[:, 1:]) * (1 - m[:, :-1]) + (Ib[:, :-1] - Ib[:, 1:]) * m[:, :-1]

Then for each channel it solves a least-squares system Ax = b

where A

stacks the gradient operators and an identity term, and b

stacks the target gradients and the original intensities:

A = As[i]
b = np.vstack([im_dx * weight, im_dy * weight, im])
out = scipy.sparse.linalg.lsqr(A, b)

Two things make this practical. First, grad_weight=[2.5, 0.5, 0.5]

weights the L (lightness) channel five times harder than the two color channels. Lightness carries the structure your eye locks onto, so the solver is told to preserve lightness gradients tightly and let color relax. Second, the big sparse matrix A

depends only on image size and weights, not on pixel values, so it is built once and cached:

crt_states = (h, w, grad_weight)
if As is None or crt_states != prev_states:
    As = construct_A(*crt_states)
    prev_states = crt_states

Building A

walks every pixel to wire up the gradient operators, which is slow. For a video you run poisson_fusion

on hundreds of frames at the same resolution, so caching it across calls turns a per-frame cost into a one-time cost. That global cache is the difference between a restyle that finishes and one you abandon.

About the author. I'm Wlad Radchenko, a |

The interval is the dial that matters most. With fast motion or a moving camera, optical flow gets unreliable and a wide interval lets the warp smear. Drop the interval so keyframes are closer together and the propagation has less work to do per gap.

The flow list is N-1, not N. If you write your own warp loop against get_flow_sequence

, remember there is one fewer flow than frame, and the temporal guide already accounts for it by special-casing i == 0

.

Inpaint radius is forgiving here. The TELEA radius of 30 looks large, but it fills guide maps, not final pixels, so a soft fill is fine. The real frame quality comes from the synthesizer, the histogram match, and the Poisson solve downstream.

Watch the last gap. The final keyframe is almost never the last frame of the clip. The i + 1 > len(...)

branch in every VideoSequence

method exists to run that tail against the real last frame. If you reimplement the bookkeeping and skip it, your output will be a few frames short and the cut will be obvious.

Per-frame restyle flickers because the model re-decides the look on every frame. The fix is to decide rarely and propagate. Stylize sparse keyframes, warp them across the gaps with EbSynth-style positional and temporal guides, match colors in Lab with a double histogram transfer, and stitch the forward and backward halves with a gradient-domain Poisson solve that weights lightness heavily and caches its matrix. None of the temporal-coherence work is diffusion. It is flow, inpainting, and two classic blends, sequenced carefully.

The code is in visual_generation/restyle/blender/

in the Wunjo Make repo: video_sequence.py

for the bookkeeping, guide.py

for the guides, histogram_blend.py

and poisson_fusion.py

for the blends. If your own restyle boils, start by cutting the per-frame diffusion down to keyframes.

source & further reading

dev.to — original article The Developer's Guide to Trimming AI API Costs Without Crying Session-Level Spending Limits Are Not Governance. Your Agent Needs Autonomy Tiers. How I Replaced Gemini with a Self-Hosted LLM for Two Production Apps

Stopping the flicker when you restyle a video frame by frame

Run your AI side-project on zahid.host