Stopping the flicker when you restyle a video frame by frame

A developer describes a technique to eliminate flicker when applying diffusion-based restyling to video frames. The approach stylizes only a sparse set of keyframes and uses optical flow to warp those stylized frames to fill the gaps, ensuring temporal consistency. The implementation includes scene detection, keyframe selection, and a VideoSequence class for managing frame sequences.

Run a diffusion restyle on every frame of a clip, one frame at a time, and the still images look great. Then you play them back and the whole thing boils. Textures crawl, colors pulse, a brick wall shifts its grout lines every frame. The model did nothing wrong on any single frame. It just made a slightly different choice each time, and at 24 frames a second your eye reads those differences as flicker. This is a walkthrough of the code that kills that flicker. The trick is to stop restyling every frame. Stylize a few frames, then carry the style to the rest by following the motion. The interesting part is the bookkeeping and the blending that make the carry invisible, so I will spend most of the post there. Each frame is a separate sample from the model, so each frame lands in a slightly different place. Played in sequence, those small differences become flicker. Photo: Unsplash. A diffusion model starts from noise and walks toward an image. Two frames of a video that look almost identical to you are still two different starting points and two different walks. The model has no memory of what it drew last frame, so it picks a fresh interpretation of "oil painting" or "anime" each time. On a still you never notice. In motion you see the model changing its mind 24 times a second. You can lower the denoise strength so the model stays close to the input, but then you barely restyle anything. You can feed the previous frame back in, which helps a little and drifts a lot. The cleaner answer is structural: restyle only a sparse set of frames, and fill the gaps by warping a real stylized frame into place. A warped frame cannot disagree with itself between neighbors, because it is the same pixels pushed along the motion. This is the idea behind Rerender A Video Yang et al., SIGGRAPH Asia 2023 and behind EbSynth Jamriška et al., ACM ToG 2019 , and it is what the code below implements. Keyframes come from scene detection plus a fixed interval. Inside the scene detector: if interval: scene frames.extend range start frame, end frame, interval scene frames.append end frame - 1 The default interval is 10. So inside each detected scene you take every tenth frame, plus the last frame of the scene, as keyframes. Those are the only frames the diffusion model ever touches. Everything between two keyframes is going to be synthesized by warping, not by the model. Pick the interval too large and motion outruns the warp; pick it too small and you pay for diffusion you did not need. Ten is a reasonable middle for most footage. Once you commit to "stylize keyframes, propagate the rest," you inherit a filing problem. For every gap between keyframe i and keyframe i+1 you need the right input frames, the right output paths, the right optical-flow files, and the right guide images, in both the forward and backward direction. Get one index off and a frame lands in the wrong folder. VideoSequence in video sequence.py is the class that does this filing. It is constructed with the list of keyframes here called frame files with interval and it makes one output subdirectory per keyframe: self. frame files with interval = f for f in frame files with interval if ".png" in f self. n seq = len self. frame files with interval ... out subdir = self. get out subdir frame file out <keyframe-name The core method is get input sequence . Given a gap index i , it returns the list of input frame paths in that gap: python def get input sequence self, i, is forward=True : if i + 1 len self. frame files with interval - 1: last gap: run from the final keyframe to the true last frame of the video last input frame = self. input frames -1 last interval frame = self. frame files with interval i if last input frame == last interval frame: return None else: beg id = int last interval frame.split "." 0 end id = int last input frame.split "." 0 else: beg id = self.get sequence beg id i end id = self.get sequence beg id i + 1 if is forward: id list = list range beg id, end id else: id list = list range end id, beg id, -1 return os.path.join self. input dir, self. input format % id for id in id list if self. input format % id in self. input frames Two things to notice. First, beg id and end id come straight from the keyframe filenames, which are named by frame number %04d.jpg . The filename is the index. That is why the whole class can do its math on int name.split "." 0 instead of carrying a separate table. Second, the is forward flag reverses the range. The pipeline propagates style from the left keyframe rightward, and from the right keyframe leftward, then meets in the middle. The backward pass needs the same frames in reverse, and this one flag gives both. The same shape repeats for every artifact the propagation needs: get output sequence builds the destination paths inside the keyframe's out folder. get flow sequence builds flow f %04d.npy for forward and flow b %04d.npy for backward. get edge sequence , get temporal sequence , get pos sequence build the per-frame guide paths in the keyframe's tmp folder.One detail worth flagging: the flow lists are one element shorter than the frame lists. There are N frames but only N-1 motions between them: if is forward: id list = list range beg id, end id - 1 forward flows: N-1 else: id list = list range end id, beg id + 1, -1 backward flows: N-1 If you ever zip flows against frames and get an off-by-one, this is where it comes from. The class is built so the flow list and the warp loop line up. The last gap is the awkward one. The final keyframe is rarely the literal last frame of the video, so the code special-cases it: when i+1 runs past the keyframe list, it uses self. input frames -1 , the true last frame, as the end of the gap. Without that branch the tail of every clip would go unstyled. Propagation here is done EbSynth-style: you give EbSynth a source stylized image and a set of guide channels, and it synthesizes the target frame so it matches the style of the source while respecting the guides. The guides live in guide.py . Each one answers a different question for the synthesizer. The positional guide answers "where did this pixel come from?" It starts from a synthetic image where each pixel encodes its own coordinate as color, then warps that image along the optical flow, frame after frame: python @staticmethod def generate first img H, W : Hs = np.linspace 0, 1, H Ws = np.linspace 0, 1, W i, j = np.meshgrid Hs, Ws, indexing='ij' r = i 255 .astype np.uint8 row - red g = j 255 .astype np.uint8 col - green b = np.zeros r.shape return np.stack b, g, r , 2 Red is the row, green is the column. After you warp this map by the flow, a pixel's color tells you which original pixel ended up there. That is a dense, smooth correspondence field, which is exactly what a synthesizer wants so it does not invent new texture in moving regions. The temporal guide answers "what did the previous stylized frame look like, moved to here?" It takes the previous stylized frame and warps it forward by the flow: php def get cmd self, i, weight - str: if i == 0: warped img = self.stylized imgs 0 else: prev img = cv2.imread self.stylized imgs i - 1 warped img = self.flow calc.warp prev img, self.flows i - 1 , 'nearest' .astype np.uint8 warped img = cv2.inpaint warped img, self.masks i - 1 , 30, cv2.INPAINT TELEA cv2.imwrite self.imgs i , warped img return super .get cmd i, weight This is the anti-flicker guide. It pushes the synthesizer to make frame i look like frame i-1 carried along the motion, so the style stays put on a surface as it moves instead of re-rolling every frame. Both warps leave holes. Where motion uncovers a region the camera could not see last frame, the warp has no data, and the optical-flow mask marks those pixels. The fix is the same in both guides: cur img = cv2.inpaint cur img, mask, 30, cv2.INPAINT TELEA cv2.INPAINT TELEA fills the disoccluded holes from their surroundings so the guide has no black gaps. A radius of 30 pixels is generous, which suits the smooth guide maps; you do not need sharp inpainting here, just plausible filler. The other two guides are simpler. The edge guide runs a Laplacian-style filter so the synthesizer keeps structure aligned to the input: filter = np.array 0, -1, 0 , -1, 4, -1 , 0, -1, 0 res = cv2.filter2D img, -1, filter And the color guide is just the raw frames, so the synthesizer has the original colors to refer to. Each guide carries a -weight , so you can dial how strongly the synthesizer listens to motion versus structure versus color. Now you have two stylized versions of every in-between frame: one propagated forward from the left keyframe, one propagated backward from the right keyframe. They agree on geometry but rarely on color, because each picked up the tint of a different keyframe along the way. Stack them naively and you get a visible color step. histogram blend.py fixes the color before the seam gets stitched. It works in Lab color space, which separates lightness from color so you can match tone without muddying brightness: a = cv2.cvtColor a, cv2.COLOR BGR2Lab b = cv2.cvtColor b, cv2.COLOR BGR2Lab normalize each to a common mean/std t mean val = 0.5 256 t std val = 1 / 36 256 a = histogram transform a, a mean, a std, t mean, t std b = histogram transform b, b mean, b std, t mean, t std average them, then re-key to the reference frame's statistics ab = a weight1 + b weight2 - t mean val / 0.5 + t mean val ab = histogram transform ab, ab mean, ab std, min error mean, min error std The shape is: push both images to the same neutral mean and standard deviation, average them, then push the average to the statistics of min error , the frame the pipeline trusts most for this position. The two odd constants are a target mean of 0.5 256 mid-gray and a target std of 1/36 256 . They are just a stable common ground to average in; the final re-keying is what makes the result match a real frame rather than a washed-out midpoint. This is per-channel mean/std transfer, the same idea as classic Reinhard color transfer, done twice. Color matching makes the two halves agree on average. It does not hide the actual boundary where forward meets backward. For that the pipeline pastes in the gradient domain, the same Poisson Image Editing idea from Pérez, Gangnet and Blake SIGGRAPH 2003 that photo tools use to drop an object into a new background without a halo. The principle: do not copy pixels, copy differences between pixels. Build a target gradient field by taking gradients from image 1 outside the mask and image 2 inside it, then solve for the image whose gradients match that field. Seams disappear because you never enforce an absolute pixel value at the boundary, only the slope across it. python def poisson fusion blendI, I1, I2, mask, grad weight= 2.5, 0.5, 0.5 : Iab = cv2.cvtColor blendI, cv2.COLOR BGR2LAB .astype float Ia = cv2.cvtColor I1, cv2.COLOR BGR2LAB .astype float Ib = cv2.cvtColor I2, cv2.COLOR BGR2LAB .astype float m = mask 0 .astype float :, :, np.newaxis gradient from I1 outside the mask, from I2 inside it gx :-1 = Ia :-1 - Ia 1: 1 - m :-1 + Ib :-1 - Ib 1: m :-1 gy :, :-1 = Ia :, :-1 - Ia :, 1: 1 - m :, :-1 + Ib :, :-1 - Ib :, 1: m :, :-1 Then for each channel it solves a least-squares system Ax = b where A stacks the gradient operators and an identity term, and b stacks the target gradients and the original intensities: A = As i b = np.vstack im dx weight, im dy weight, im out = scipy.sparse.linalg.lsqr A, b Two things make this practical. First, grad weight= 2.5, 0.5, 0.5 weights the L lightness channel five times harder than the two color channels. Lightness carries the structure your eye locks onto, so the solver is told to preserve lightness gradients tightly and let color relax. Second, the big sparse matrix A depends only on image size and weights, not on pixel values, so it is built once and cached: crt states = h, w, grad weight if As is None or crt states = prev states: As = construct A crt states prev states = crt states Building A walks every pixel to wire up the gradient operators, which is slow. For a video you run poisson fusion on hundreds of frames at the same resolution, so caching it across calls turns a per-frame cost into a one-time cost. That global cache is the difference between a restyle that finishes and one you abandon. About the author. I'm Wlad Radchenko, a | The interval is the dial that matters most. With fast motion or a moving camera, optical flow gets unreliable and a wide interval lets the warp smear. Drop the interval so keyframes are closer together and the propagation has less work to do per gap. The flow list is N-1, not N. If you write your own warp loop against get flow sequence , remember there is one fewer flow than frame, and the temporal guide already accounts for it by special-casing i == 0 . Inpaint radius is forgiving here. The TELEA radius of 30 looks large, but it fills guide maps, not final pixels, so a soft fill is fine. The real frame quality comes from the synthesizer, the histogram match, and the Poisson solve downstream. Watch the last gap. The final keyframe is almost never the last frame of the clip. The i + 1 len ... branch in every VideoSequence method exists to run that tail against the real last frame. If you reimplement the bookkeeping and skip it, your output will be a few frames short and the cut will be obvious. Per-frame restyle flickers because the model re-decides the look on every frame. The fix is to decide rarely and propagate. Stylize sparse keyframes, warp them across the gaps with EbSynth-style positional and temporal guides, match colors in Lab with a double histogram transfer, and stitch the forward and backward halves with a gradient-domain Poisson solve that weights lightness heavily and caches its matrix. None of the temporal-coherence work is diffusion. It is flow, inpainting, and two classic blends, sequenced carefully. The code is in visual generation/restyle/blender/ in the Wunjo Make repo https://github.com/wladradchenko/wunjo.wladradchenko.ru : video sequence.py for the bookkeeping, guide.py for the guides, histogram blend.py and poisson fusion.py for the blends. If your own restyle boils, start by cutting the per-frame diffusion down to keyframes.