DSA Application in Real Life: How Git Diff Works: LCS Intuition, Myers Algorithm, and Real Code Changes

The article explains that Git's `diff` functionality is powered by the Longest Common Subsequence (LCS) algorithm, which treats file content as sequences of lines to identify what has been added, removed, or unchanged. It contrasts a naive line-by-line comparison with the LCS approach, demonstrating how LCS correctly identifies that only one line was inserted in a code example, while also noting that Git uses the more efficient Myers diff algorithm rather than the classic O(m×n) dynamic programming solution.

The Algorithm Hiding Behind git diff You've run git diff hundreds of times. Red lines. Green lines. Done. But have you ever stopped and asked — what algorithm is actually doing that? It turns out, the idea is closely related to one of the most classic problems in computer science: Longest Common Subsequence . In this article, we'll explore how Git-style diffing works, why LCS is the right mental model, how the actual algorithm Git uses — Myers diff — connects to it, and what tradeoffs real tools make when choosing a diff algorithm. This is the first article in my series "DSA Application in Real Life" — where I explore how common data structures and algorithms power the tools developers use every day. The Problem Git Is Solving Imagine we have an old version of a file: function add a, b { return a + b; } Then we update it: function addNumbers a, b { return a + b; } When we run git diff , Git shows: -function add a, b { +function addNumbers a, b { return a + b; } This looks obvious to us as humans. Only the function name changed. But Git does not "understand" JavaScript the way we do. At the diffing level, Git treats the file as a sequence of lines . Its job is to compare two sequences and decide: - Which lines stayed the same? - Which lines were deleted? - Which lines were added? This is a sequence comparison problem — and that's exactly where LCS comes in. Why Simple Line-by-Line Comparison Is Not Enough A beginner might think Git just compares files line by line: Old line 1 vs New line 1 Old line 2 vs New line 2 Old line 3 vs New line 3 This works only when changes happen at the same position. But real code changes are rarely that simple. Consider this old file: login validate save logout Now we insert one new line: login checkPermission validate save logout A naive line-by-line comparison would produce: Old: login New: login same Old: validate New: checkPermission different Old: save New: validate different Old: logout New: save different Old: nothing New: logout added That makes it look like almost the entire file changed — which is completely wrong. Only one line was added. A smarter approach does not compare by position only. It first finds what stayed common between the two files. That is the LCS idea. LCS: The Mental Model Behind Diffing LCS stands for Longest Common Subsequence . A subsequence means you can pick elements from a sequence while keeping their relative order — but they do not need to be adjacent. Example: Old = A, B, C, D New = A, C, E, D The longest common subsequence is: A, C, D Because A , C , and D appear in both sequences in the same order. Applied to file diffing, the lines of each file become the sequences: Old = login , validate , save , logout New = login , checkPermission , validate , save , logout The LCS is: login , validate , save , logout Now Git-style diffing can reason: - These lines are common → unchanged - checkPermission is only in the new file → added Result: login +checkPermission validate save logout That's the core idea. The Actual LCS Algorithm with Code Here's the classic dynamic programming solution you've likely seen in competitive programming: python def lcs length A, B : m, n = len A , len B dp i j = LCS length of A :i and B :j dp = 0 n + 1 for in range m + 1 for i in range 1, m + 1 : for j in range 1, n + 1 : if A i - 1 == B j - 1 : dp i j = dp i - 1 j - 1 + 1 else: dp i j = max dp i - 1 j , dp i j - 1 return dp m n For sequences: A = A, B, C, D B = A, C, E, D The DP table looks like this: "" A C E D "" 0 0 0 0 0 A 0 1 1 1 1 B 0 1 1 1 1 C 0 1 2 2 2 D 0 1 2 2 3 The answer is: dp 4 4 = 3 So the LCS length is 3 , and the LCS is: A, C, D Time complexity: O m × n Space complexity: O m × n For large files, this gets expensive — which is why Git does not use textbook LCS directly. Reconstructing the LCS from the DP Table The DP table gives us the length of the LCS. But to build an actual diff, we also need the common lines themselves. We can get them by backtracking from the bottom-right corner of the table. python def build lcs A, B : m, n = len A , len B dp = 0 n + 1 for in range m + 1 Build DP table for i in range 1, m + 1 : for j in range 1, n + 1 : if A i - 1 == B j - 1 : dp i j = dp i - 1 j - 1 + 1 else: dp i j = max dp i - 1 j , dp i j - 1 Backtrack to reconstruct the actual LCS lcs = i, j = m, n while i 0 and j 0: if A i - 1 == B j - 1 : lcs.append A i - 1 i -= 1 j -= 1 elif dp i - 1 j = dp i j - 1 : i -= 1 else: j -= 1 return lcs ::-1 old file = "A", "B", "C", "D" new file = "A", "C", "E", "D" print build lcs old file, new file Output: 'A', 'C', 'D' Now we do not only know the LCS length. We also know the actual common lines. That is what lets us decide which lines stayed unchanged, which lines were deleted, and which lines were added. How LCS Builds the Diff Once you know the LCS, building the diff becomes straightforward: - Lines in the LCS → unchanged - Lines in old but not in the LCS → deleted - Lines in new but not in the LCS → added Example: Old = A, B, C, D New = A, C, E, D LCS = A, C, D B is only in Old → deleted E is only in New → added Diff output: A -B C +E D This is the basic shape of what Git, GitHub pull requests, VS Code comparison, and merge tools show: unchanged lines, deleted lines, and added lines. Does Git Actually Use Textbook LCS? Not directly. Git's default algorithm is Myers diff — and it solves a slightly different but deeply related problem called the Shortest Edit Script . The Shortest Edit Script asks: What is the smallest number of insertions and deletions needed to transform the old file into the new file? LCS and Shortest Edit Script are closely connected. LCS asks: What is the longest structure that stayed the same? Shortest Edit Script asks: What is the smallest set of changes needed to transform one sequence into another? When only insertions and deletions are allowed, minimizing the edit script is mathematically related to maximizing the LCS length. For two sequences with lengths m and n : edit distance = m + n - 2 × LCS length So yes, they are two sides of the same coin — but they approach the problem from different directions. When we say "Git uses LCS-based diffing," the accurate meaning is: Git's diffing is based on sequence-comparison ideas rooted in LCS, but its default implementation uses Myers' shortest edit script algorithm, which is faster in practice. How Myers Diff Actually Works Myers models the diff problem as a graph search. Imagine a grid where: - The X-axis represents lines of the old file - The Y-axis represents lines of the new file - Moving right means deleting a line from the old file - Moving down means inserting a line from the new file - Moving diagonally means the lines match, so no edit is needed For: Old = A, B, C, D New = A, C, E, D The matching positions are: A matches A C matches C D matches D A simplified grid looks like this: Old A B C D ┌────┬────┬────┬────┐ New A │╲ │ │ │ │ ├────┼────┼────┼────┤ C │ │ │╲ │ │ ├────┼────┼────┼────┤ E │ │ │ │ │ ├────┼────┼────┼────┤ D │ │ │ │╲ │ └────┴────┴────┴────┘ The diagonal marks show where the two sequences have the same line. But the important part is not only the matching cells. The important part is the path from the top-left corner to the bottom-right corner. For this example, one shortest path is: 0,0 │ ├─ diagonal: A matches A │ 1,1 │ ├─ right: delete B │ 2,1 │ ├─ diagonal: C matches C │ 3,2 │ ├─ down: insert E │ 3,3 │ ├─ diagonal: D matches D │ 4,4 So the path is: diagonal → right → diagonal → down → diagonal Which means: Keep A Delete B Keep C Insert E Keep D Now let's walk through that path step by step. Step 1: Match A Both files start with A , so Myers can move diagonally. Old: A B C D New: A C E D Match: A No edit is needed. Step 2: Delete B After A , the old file has B , but the new file has C . They do not match, so one shortest path deletes B . -B Step 3: Match C Now both sides line up at C , so Myers moves diagonally again. Match: C Step 4: Insert E After C , the new file has E , but the old file moves toward D . So Myers inserts E . +E Step 5: Match D Finally, both files match again at D . The final shortest edit script is: A -B C +E D In this example, the shortest edit script has only two edits: Delete B Insert E So here, D = 2 . That is the key idea behind Myers. It is not randomly comparing lines. It is searching the edit graph for the shortest path that converts one sequence into another. The path with the fewest right and down moves gives the shortest edit script. Diagonal moves are free because they represent lines that already match. The algorithm is commonly described as: Time complexity: O ND Where: - N is the total number of lines across both files - D is the size of the shortest edit script In simple words, D means how many insertions and deletions are needed to transform the old file into the new file. For space complexity, it depends on the implementation: Common Myers implementation: O N Linear-space Myers variant: O D This is why Myers performs very well when two files are mostly similar, which is the common case in real codebases. Instead of comparing every possible pair of lines like textbook LCS DP, Myers focuses on finding a short path of edits between the two versions. Diff as an Edit Script Let's walk through a concrete edit script: Old file → A, B, C, D New file → A, C, E, D Step 1: Delete B A, C, D Step 2: Insert E after C A, C, E, D Edit script: Delete B, Insert E That is just two operations. Git-style diff output: A -B C +E D Clean, minimal, and easy to understand. Why This Matters in Real Development When we review code, we're not just looking at text changes — we're trying to understand intent. A good diff makes that easy: function calculateTotal items { - return items.length; + return items.reduce sum, item = sum + item.price, 0 ; } Any reviewer immediately understands: the old code counted items, and the new code sums their prices. A bad diff creates noise and confusion. That's why diff algorithms matter. They are not only about correctness. They are also about readability. The Tradeoff: Shortest Diff vs Most Readable Diff The smallest diff is not always the most readable one — especially in code with repeated patterns: if user { return true; } if admin { return true; } if owner { return true; } When many lines look similar, a diff algorithm can match the wrong lines. The result may be technically correct, but hard to read. That's why Git ships multiple diff algorithms. Git's Four Diff Algorithms git diff --diff-algorithm=myers default git diff --diff-algorithm=minimal git diff --diff-algorithm=patience git diff --diff-algorithm=histogram Here's what each one does and when to use it. Myers Fast and generally good. This is what runs when you just type: git diff Best for everyday use. Minimal Tries harder to find the smallest possible diff. It can be slower, but useful when patch size matters. Patience Prioritizes human readability. It matches unique lines first, which helps avoid false alignments on repeated code. Best for reviewing refactors or moved code blocks. Histogram An evolution of Patience that also handles low-frequency lines well. It often produces readable output for real codebases. Some developers prefer setting it as their global default because it can make source code diffs easier to review. To set Histogram as your global default: git config --global diff.algorithm histogram Myers vs Patience Diff Myers is very good at finding a short edit script. But sometimes the shortest diff is not the most readable diff. This usually happens when a file has repeated or similar-looking lines. In that case, the algorithm may choose matches that are technically valid but not ideal for human review. Consider this example. Old version: python def validate user user : if not user.email: return False return True def save user user : database.save user def validate admin admin : if not admin.email: return False return True New version: python def validate admin admin : if not admin.email: return False return True def validate user user : if not user.email: return False return True def save user user : database.save user Here, validate admin moved from the bottom to the top. Because the functions contain repeated lines like: return False return True a shortest-diff algorithm can sometimes align the repeated lines in a confusing way. A Myers-style diff may produce a technically correct result like this: python +def validate admin admin : + if not admin.email: + return False + return True + def validate user user : if not user.email: return False return True def save user user : database.save user - -def validate admin admin : - if not admin.email: - return False - return True This output is correct: it shows that validate admin was added at the top and removed from the bottom. But for a reviewer, the important idea is simpler: One function moved position. Patience diff tries to make this kind of refactor easier to read by first looking for unique lines as anchors, such as: python def validate user user : def save user user : def validate admin admin : These unique lines help the algorithm avoid matching only repeated lines like return False and return True . For this small example, Patience may still produce an output that looks very similar to Myers: python +def validate admin admin : + if not admin.email: + return False + return True + def validate user user : if not user.email: return False return True def save user user : database.save user - -def validate admin admin : - if not admin.email: - return False - return True So the important point is not that Patience magically shows a special "move" operation. Git diffs are still usually represented as additions and deletions. The real benefit of Patience appears more clearly in larger refactors, especially when a file has many repeated lines such as: return False return True else: break continue } In those cases, Myers may match repeated lines too aggressively, while Patience prefers stronger unique anchors. That often makes the final diff easier for humans to review. The exact output can vary depending on file context and Git version, but the idea is the same: - Myers focuses on finding a short edit script. - Patience focuses more on stable, unique anchors. - The shortest diff is not always the clearest diff. Algorithm Complexity Summary | Algorithm | Rough Idea | Best For | |---|---|---| | Textbook LCS DP | O m × n time and space | Learning the concept | | Myers diff | O ND in the common description | Default everyday diffs | | Minimal | Spends extra work to reduce diff size | Smaller patches | | Patience | Uses unique lines as anchors | Refactors / moved blocks | | Histogram | Extends Patience using low-frequency lines | Often readable code diffs | Where: - m = number of lines in the old file - n = number of lines in the new file - N = total number of lines across both files - D = size of the shortest edit script Where the DSA Is Hiding In competitive programming, LCS is a textbook DP problem. In the real world, the same idea appears in: Git diff GitHub pull request review VS Code file comparison Merge conflict resolution Google Docs version history Code review platforms Patch generation The input changes — lines of code, words in a document, DOM nodes in a UI, events in a timeline — but the core question is always the same: What stayed the same, and what changed? A Real-World Developer Example Old code: js function createUser name, email { const user = { name, email }; saveUser user ; return user; } New code: js function createUser name, email, role { const user = { name, email, role }; validateUser user ; saveUser user ; return user; } A well-tuned diff shows: -function createUser name, email { +function createUser name, email, role { - const user = { name, email }; + const user = { name, email, role }; + validateUser user ; saveUser user ; return user; } Any reviewer immediately understands: - a role parameter was added - the role is stored on the user object - validation was introduced before saving That's the value of a good diff algorithm. It is not just computing differences. It is helping humans understand change. Why Git Usually Works at the Line Level By default, Git usually presents diffs at the line level because source code is naturally organized line by line. For example, if we change this line: js -const total = price quantity; +const total = price quantity tax; A character-level diff could say that only tax was appended. That is more precise, but precision is not always the same as readability. In real code reviews, developers usually care about which lines changed and how those changes affect the surrounding code. Character-level diffs can become noisy very quickly, especially when formatting, indentation, or multiple small edits happen in the same line. That is why line-level diffing is a good default for most developer workflows. But Git still gives you more detailed options when you need them: git diff --word-diff This shows word-level changes inside modified lines. The best algorithm is not always the most precise one. It is the one that gives the most useful output for the context. LCS vs Myers: The Mental Model LCS: Find the longest part that stayed the same. Myers: Find the shortest set of changes to get from old to new. LCS gives you the intuition. Myers gives Git an efficient practical algorithm. When only insertions and deletions are allowed, these two views are mathematically connected: edit distance = old length + new length - 2 × LCS length So: - If the LCS is long → fewer edits are needed - If the LCS is short → more edits are needed They measure the same underlying change from different directions. Why This Is a Great Example of DSA in Real Life Many beginners ask: Where do we actually use DSA in real projects? git diff is one of the best answers — because every developer runs it daily without thinking about it. When you run git diff , you're using an algorithm. When you review a pull request on GitHub, you're using an algorithm. When you resolve merge conflicts, you're relying on algorithms that compare versions of files. The algorithm is invisible behind a clean developer experience. That's what good engineering looks like: the user sees red and green lines, and behind it is a carefully designed algorithmic solution built on decades of computer science research. That's the beauty of DSA. Not just for interviews. Inside the tools you use every day. Practical Commands to Try Try different algorithms on any repo git diff --diff-algorithm=myers git diff --diff-algorithm=patience git diff --diff-algorithm=histogram git diff --diff-algorithm=minimal Word-level diff, great for prose or config files git diff --word-diff Set histogram as your permanent default git config --global diff.algorithm histogram Final Thoughts When you first learn LCS, it may look like just another dynamic programming problem. But the core idea is powerful: Find what stayed the same so we can understand what changed. That simple idea appears everywhere. Git uses related sequence-comparison ideas to show file changes. Code review tools use similar techniques to help developers understand pull requests. Merge tools use them to combine work from different branches. Document editors use them to show version history. So the next time you run: git diff remember that you are not just seeing red and green lines. You are seeing dynamic programming intuition, graph search, and decades of algorithmic research — all compressed into one everyday developer command.