Skip to content

Conversation

ChrisPenner
Copy link
Member

@ChrisPenner ChrisPenner commented Oct 8, 2025

Overview

This represents a full overhaul of our diffing algorithm. The goal is to better align diffs in spite of change.

The old algorithm was entirely token-oriented, it didn't do anything to special-case lines, so if a bunch of lines were deleted on one side, the other side's diff would get out of alignment.

The new algorithm inserts spacers where necessary to keep similar chunks in line.

Here's an example diff using the new algorithm,
Each line is marked with -/+ when changed, and changed tokens within changed lines are highlighted by wrapping with {} (Don't worry, Simon will make the rendered version prettier)

=====left====
unchangedDefinition = 1
myDefinition : Nat -> Text -> Text
myDefinition n txt =
  -- Check for non-positive n
  if n == 0 then txt
  else Text.drop n txt

====right====
unchangedDefinition = 1
myNewDefinition : Int -> Text -> Text
myNewDefinition k input =
  -- Check for non-positive n
  if n <= 0
    then txt
  else
    Text.take n input

====diff====
  unchangedDefinition = 1                |   unchangedDefinition = 1
- {myDefinition} : {Nat} -> Text -> Text | + {myNewDefinition} : {Int} -> Text -> Text
- {myDefinition} {n} {txt} =             | + {myNewDefinition} {k} {input} =
    -- Check for non-positive n          |     -- Check for non-positive n
-   if n {==} 0 then txt                 | +   if n {<=} 0
-   else {Text.drop} n {txt}             | +  { }{ }{ }then txt
////                                     | +   else
////                                     | +   { }{ }{Text.take}{ }n {input}

Implementation notes

At a high level, the algorithm:

  • First runs the diff algorithm using entire lines as tokens, this gives us a list of equal lines, with chunks of changed lines in between
  • Now we can map over the changed trunks and run a token-diff on each of those. This behaves similar to the old algorithm
  • We then compute the difference between the number of lines in the lhs and rhs diff sides; and add padding to fill out the smaller.
  • This gives us two lists of lines, one for lhs, one for rhs. Each line is tagged with whether it's been changed or not. Lines with changes include annotations on each token indicating whether that particular token exists on the other side. This includes annotations for cases where the name is the same but the hash changed and vice versa.

This replaces the other diff algorithm entirely, which is currently unused in UCM, but will need matching updates from Simon when deploying to Share.

Test coverage

I added property tests to generate random diffs and assert that the diffs always have matching line-counts (asserting that spacers are being added)

I also added explicit test-cases using a text-rending of diffs.

There are also transcript tests to show the output json, but those aren't good for evaluating the diff effectiveness.

Loose ends

There's still another pass we can do on this to be smarter about where within a change block we put spacers, e.g. look at this VS code diff where spacers are interleaved within the change block to line up similar words.

image

Pretty sure I can iterate on this to get the same, but this is already a significant enough change that it's worth it to ship.

@ChrisPenner ChrisPenner force-pushed the cp/side-by-side-diff-2 branch from 7cc0cc4 to fa3e015 Compare October 9, 2025 00:09
@@ -127,187 +127,436 @@ GET /api/projects/scratch/diff/terms?oldBranchRef=main&newBranchRef=new&oldTerm=
RESPONSE:
{
"diff": {
Copy link
Member Author

@ChrisPenner ChrisPenner Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hojberg note changes to the diff response.

Now the diff has {left: <lines>, right: <lines>}
where each side is an array of lines;

Each line is one of:

{kind: changed
, value: [<difftokens>]
}
{kind: unchanged
, value: [<difftokens>]
}
{kind: spacer}

The tokens themselves are similar to before, but for now each token is split up into its own object rather than having a both with a list of tokens.

We may actually want to look at more efficient serializations since I know Mitchell had noticed the diff sizes were getting really big.

I'm also looking at IDs which associate tokens on the lhs with their rhs counter-part when they match; that way you could hover the lhs and have it light up the matching token on the other side if we want :)

@hojberg
Copy link
Member

hojberg commented Oct 9, 2025

@ChrisPenner how might I try this out with Share locally?

@ChrisPenner
Copy link
Member Author

@ChrisPenner how might I try this out with Share locally?

@hojberg Here's a branch for ya!
unisoncomputing/share-api#155

@ChrisPenner ChrisPenner force-pushed the cp/side-by-side-diff-2 branch from ddba7a2 to 49adb61 Compare October 14, 2025 16:52
@ChrisPenner ChrisPenner marked this pull request as ready for review October 16, 2025 20:30
@ChrisPenner ChrisPenner requested a review from aryairani October 16, 2025 20:31
@ChrisPenner
Copy link
Member Author

@aryairani this is already up on Share and seems to be working well :)

Go ahead and merge unless you've got any concerns 👍🏼

Comment on lines +36 to +39
-- diffSyntaxText :: SyntaxText -> SyntaxText -> [SemanticSyntaxDiff Syntax.Element]
-- diffSyntaxText (AnnotatedText fromST) (AnnotatedText toST) =
-- diffSegments syntaxElementDiffEq fromST toST
-- & expandSpecialCases specialCaseAnnotations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still want this?

Comment on lines +234 to +246
-- Helpers for testing semantic diffs on plaintext

-- simpleLeft :: Text
-- simpleLeft = "one word\ntwo words\nthree words"

-- simpleRight :: Text
-- simpleRight = "one word\ndifferent words\nthree words"

-- complexLeft :: Text
-- complexLeft = "one word\ntwo words\nthree words"

-- complexRight :: Text
-- complexRight = "one word\nmulti-line\ndifference\nshould add spacers\nthree words"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and these

cabal-version: 1.12

-- This file has been generated from package.yaml by hpack version 0.36.0.
-- This file has been generated from package.yaml by hpack version 0.38.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have a draft to have CI use the latest 'stack' but I haven't had a chance to test it. fingers crossed though

@aryairani aryairani merged commit fe78e65 into trunk Oct 17, 2025
32 checks passed
@aryairani aryairani deleted the cp/side-by-side-diff-2 branch October 17, 2025 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants