Automatic merging in git without conflicts (using word-by-word diff instead of line-by-line)
up vote
1
down vote
favorite
I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).
git-diff
can show me the diff between two commits not only per-line but also per word or per character. For example:
$ git diff --word-diff-regex=. HEAD HEAD~
If git-diff
can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge
detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff
tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.
In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.
What approach would you choose?
git git-merge git-diff git-conflict-resolution
add a comment |
up vote
1
down vote
favorite
I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).
git-diff
can show me the diff between two commits not only per-line but also per word or per character. For example:
$ git diff --word-diff-regex=. HEAD HEAD~
If git-diff
can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge
detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff
tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.
In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.
What approach would you choose?
git git-merge git-diff git-conflict-resolution
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).
git-diff
can show me the diff between two commits not only per-line but also per word or per character. For example:
$ git diff --word-diff-regex=. HEAD HEAD~
If git-diff
can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge
detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff
tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.
In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.
What approach would you choose?
git git-merge git-diff git-conflict-resolution
I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).
git-diff
can show me the diff between two commits not only per-line but also per word or per character. For example:
$ git diff --word-diff-regex=. HEAD HEAD~
If git-diff
can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge
detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff
tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.
In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.
What approach would you choose?
git git-merge git-diff git-conflict-resolution
git git-merge git-diff git-conflict-resolution
edited Nov 21 at 22:58
asked Nov 21 at 22:40
Adam Libuša
445516
445516
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
1
down vote
What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.
In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.
If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.
The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.
Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record
, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).
(It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch
—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record
should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)
The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.
In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.
If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.
The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.
Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record
, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).
(It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch
—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record
should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)
The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)
add a comment |
up vote
1
down vote
What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.
In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.
If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.
The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.
Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record
, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).
(It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch
—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record
should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)
The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)
add a comment |
up vote
1
down vote
up vote
1
down vote
What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.
In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.
If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.
The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.
Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record
, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).
(It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch
—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record
should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)
The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)
What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.
In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.
If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.
The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.
Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record
, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).
(It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch
—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record
should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)
The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)
answered Nov 21 at 23:40
torek
178k16228303
178k16228303
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53421427%2fautomatic-merging-in-git-without-conflicts-using-word-by-word-diff-instead-of-l%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown