{"slug": "hacky-stablediffusion-code-for-generating-videos", "title": "hacky stablediffusion code for generating videos", "summary": "This is a Python script called `stablediffusionwalk.py` that creates hypnotic moving videos by smoothly walking randomly through Stable Diffusion's sample space. The script uses the Diffusers library and requires access to Stable Diffusion checkpoints from Hugging Face, along with various dependencies. Users can generate videos by running the script with a text prompt and then stitching the output images together using FFmpeg.", "body_md": "stablediffusionwalk.py\n\n      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.\n      \nLearn more about bidirectional Unicode characters\n\n \n    Show hidden characters\n\n\"\"\"\n\nstable diffusion dreaming\n\ncreates hypnotic moving videos by smoothly walking randomly through the sample space\n\nexample way to run this script:\n\n$ python stablediffusionwalk.py --prompt \"blueberry spaghetti\" --name blueberry\n\nto stitch together the images, e.g.:\n\n$ ffmpeg -r 10 -f image2 -s 512x512 -i blueberry/frame%06d.jpg -vcodec libx264 -crf 10 -pix_fmt yuv420p blueberry.mp4\n\nnice slerp def from @xsteenbrugge ty\n\nyou have to have access to stablediffusion checkpoints from https://huggingface.co/CompVis\n\nand install all the other dependencies (e.g. diffusers library)\n\n\"\"\"\n\nimport os\n\nimport inspect\n\nimport fire\n\nfrom diffusers import StableDiffusionPipeline\n\nfrom diffusers.schedulers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler\n\nfrom time import time\n\nfrom PIL import Image\n\nfrom einops import rearrange\n\nimport numpy as np\n\nimport torch\n\nfrom torch import autocast\n\nfrom torchvision.utils import make_grid\n\n# -----------------------------------------------------------------------------\n\n@torch.no_grad()\n\ndef diffuse(\n\n        pipe,\n\n        cond_embeddings, # text conditioning, should be (1, 77, 768)\n\n        cond_latents,    # image conditioning, should be (1, 4, 64, 64)\n\n        num_inference_steps,\n\n        guidance_scale,\n\n        eta,\n\n    ):\n\n    torch_device = cond_latents.get_device()\n\n    # classifier guidance: add the unconditional embedding\n\n    max_length = cond_embeddings.shape[1] # 77\n\n    uncond_input = pipe.tokenizer([\"\"], padding=\"max_length\", max_length=max_length, return_tensors=\"pt\")\n\n    uncond_embeddings = pipe.text_encoder(uncond_input.input_ids.to(torch_device))[0]\n\n    text_embeddings = torch.cat([uncond_embeddings, cond_embeddings])\n\n    # if we use LMSDiscreteScheduler, let's make sure latents are mulitplied by sigmas\n\n    if isinstance(pipe.scheduler, LMSDiscreteScheduler):\n\n        cond_latents = cond_latents * pipe.scheduler.sigmas[0]\n\n    # init the scheduler\n\n    accepts_offset = \"offset\" in set(inspect.signature(pipe.scheduler.set_timesteps).parameters.keys())\n\n    extra_set_kwargs = {}\n\n    if accepts_offset:\n\n        extra_set_kwargs[\"offset\"] = 1\n\n    pipe.scheduler.set_timesteps(num_inference_steps, **extra_set_kwargs)\n\n    # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature\n\n    # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.\n\n    # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502\n\n    # and should be between [0, 1]\n\n    accepts_eta = \"eta\" in set(inspect.signature(pipe.scheduler.step).parameters.keys())\n\n    extra_step_kwargs = {}\n\n    if accepts_eta:\n\n        extra_step_kwargs[\"eta\"] = eta\n\n    # diffuse!\n\n    for i, t in enumerate(pipe.scheduler.timesteps):\n\n        # expand the latents for classifier free guidance\n\n        latent_model_input = torch.cat([cond_latents] * 2)\n\n        if isinstance(pipe.scheduler, LMSDiscreteScheduler):\n\n            sigma = pipe.scheduler.sigmas[i]\n\n            latent_model_input = latent_model_input / ((sigma**2 + 1) ** 0.5)\n\n        # predict the noise residual\n\n        noise_pred = pipe.unet(latent_model_input, t, encoder_hidden_states=text_embeddings)[\"sample\"]\n\n        # cfg\n\n        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)\n\n        noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)\n\n        # compute the previous noisy sample x_t -> x_t-1\n\n        if isinstance(pipe.scheduler, LMSDiscreteScheduler):\n\n            cond_latents = pipe.scheduler.step(noise_pred, i, cond_latents, **extra_step_kwargs)[\"prev_sample\"]\n\n        else:\n\n            cond_latents = pipe.scheduler.step(noise_pred, t, cond_latents, **extra_step_kwargs)[\"prev_sample\"]\n\n    # scale and decode the image latents with vae\n\n    cond_latents = 1 / 0.18215 * cond_latents\n\n    image = pipe.vae.decode(cond_latents)\n\n    # generate output numpy image as uint8\n\n    image = (image / 2 + 0.5).clamp(0, 1)\n\n    image = image.cpu().permute(0, 2, 3, 1).numpy()\n\n    image = (image[0] * 255).astype(np.uint8)\n\n    return image\n\ndef slerp(t, v0, v1, DOT_THRESHOLD=0.9995):\n\n    \"\"\" helper function to spherically interpolate two arrays v1 v2 \"\"\"\n\n    if not isinstance(v0, np.ndarray):\n\n        inputs_are_torch = True\n\n        input_device = v0.device\n\n        v0 = v0.cpu().numpy()\n\n        v1 = v1.cpu().numpy()\n\n    dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))\n\n    if np.abs(dot) > DOT_THRESHOLD:\n\n        v2 = (1 - t) * v0 + t * v1\n\n    else:\n\n        theta_0 = np.arccos(dot)\n\n        sin_theta_0 = np.sin(theta_0)\n\n        theta_t = theta_0 * t\n\n        sin_theta_t = np.sin(theta_t)\n\n        s0 = np.sin(theta_0 - theta_t) / sin_theta_0\n\n        s1 = sin_theta_t / sin_theta_0\n\n        v2 = s0 * v0 + s1 * v1\n\n    if inputs_are_torch:\n\n        v2 = torch.from_numpy(v2).to(input_device)\n\n    return v2\n\ndef run(\n\n        # --------------------------------------\n\n        # args you probably want to change\n\n        prompt = \"blueberry spaghetti\", # prompt to dream about\n\n        gpu = 0, # id of the gpu to run on\n\n        name = 'blueberry', # name of this project, for the output directory\n\n        rootdir = '/home/ubuntu/dreams',\n\n        num_steps = 200, # number of steps between each pair of sampled points\n\n        max_frames = 10000, # number of frames to write and then exit the script\n\n        num_inference_steps = 50, # more (e.g. 100, 200 etc) can create slightly better images\n\n        guidance_scale = 7.5, # can depend on the prompt. usually somewhere between 3-10 is good\n\n        seed = 1337,\n\n        # --------------------------------------\n\n        # args you probably don't want to change\n\n        quality = 90, # for jpeg compression of the output images\n\n        eta = 0.0,\n\n        width = 512,\n\n        height = 512,\n\n        weights_path = \"/home/ubuntu/stable-diffusion-v1-3-diffusers\",\n\n        # --------------------------------------\n\n    ):\n\n    assert torch.cuda.is_available()\n\n    assert height % 8 == 0 and width % 8 == 0\n\n    torch.manual_seed(seed)\n\n    torch_device = f\"cuda:{gpu}\"\n\n    # init the output dir\n\n    outdir = os.path.join(rootdir, name)\n\n    os.makedirs(outdir, exist_ok=True)\n\n    # init all of the models and move them to a given GPU\n\n    lms = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule=\"scaled_linear\")\n\n    pipe = StableDiffusionPipeline.from_pretrained(weights_path, scheduler=lms, use_auth_token=True)\n\n    pipe.unet.to(torch_device)\n\n    pipe.vae.to(torch_device)\n\n    pipe.text_encoder.to(torch_device)\n\n    # get the conditional text embeddings based on the prompt\n\n    text_input = pipe.tokenizer(prompt, padding=\"max_length\", max_length=pipe.tokenizer.model_max_length, truncation=True, return_tensors=\"pt\")\n\n    cond_embeddings = pipe.text_encoder(text_input.input_ids.to(torch_device))[0] # shape [1, 77, 768]\n\n    # sample a source\n\n    init1 = torch.randn((1, pipe.unet.in_channels, height // 8, width // 8), device=torch_device)\n\n    # iterate the loop\n\n    frame_index = 0\n\n    while frame_index < max_frames:\n\n        # sample the destination\n\n        init2 = torch.randn((1, pipe.unet.in_channels, height // 8, width // 8), device=torch_device)\n\n        for i, t in enumerate(np.linspace(0, 1, num_steps)):\n\n            init = slerp(float(t), init1, init2)\n\n            print(\"dreaming... \", frame_index)\n\n            with autocast(\"cuda\"):\n\n                image = diffuse(pipe, cond_embeddings, init, num_inference_steps, guidance_scale, eta)\n\n            im = Image.fromarray(image)\n\n            outpath = os.path.join(outdir, 'frame%06d.jpg' % frame_index)\n\n            im.save(outpath, quality=quality)\n\n            frame_index += 1\n\n        init1 = init2\n\nif __name__ == '__main__':\n\n    fire.Fire(run)", "url": "https://wpnews.pro/news/hacky-stablediffusion-code-for-generating-videos", "canonical_source": "https://gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355", "published_at": "2022-08-16 00:44:01+00:00", "updated_at": "2026-05-23 01:04:52.643059+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "open-source", "developer-tools", "research"], "entities": ["StableDiffusion", "CompVis", "Hugging Face", "FFmpeg"], "alternates": {"html": "https://wpnews.pro/news/hacky-stablediffusion-code-for-generating-videos", "markdown": "https://wpnews.pro/news/hacky-stablediffusion-code-for-generating-videos.md", "text": "https://wpnews.pro/news/hacky-stablediffusion-code-for-generating-videos.txt", "jsonld": "https://wpnews.pro/news/hacky-stablediffusion-code-for-generating-videos.jsonld"}}