Automatic merging in git without conflicts (using word-by-word diff instead of line-by-line)











up vote
1
down vote

favorite












I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).



git-diff can show me the diff between two commits not only per-line but also per word or per character. For example:



$ git diff --word-diff-regex=. HEAD HEAD~


If git-diff can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.



In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.



What approach would you choose?










share|improve this question




























    up vote
    1
    down vote

    favorite












    I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).



    git-diff can show me the diff between two commits not only per-line but also per word or per character. For example:



    $ git diff --word-diff-regex=. HEAD HEAD~


    If git-diff can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.



    In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.



    What approach would you choose?










    share|improve this question


























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).



      git-diff can show me the diff between two commits not only per-line but also per word or per character. For example:



      $ git diff --word-diff-regex=. HEAD HEAD~


      If git-diff can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.



      In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.



      What approach would you choose?










      share|improve this question















      I’d like to automatically merge commits where each commit changes a different word on the same line. The goal is to use git as a document storage and access it programmatically (thus, ideally without having to resolve conflicts). In my use case, I know for sure that the changes will not overlap (they don't affect the same words, although the lie on the same line).



      git-diff can show me the diff between two commits not only per-line but also per word or per character. For example:



      $ git diff --word-diff-regex=. HEAD HEAD~


      If git-diff can identify the words (as opposed to the entire lines) that changed, I was convinced I could make git-merge detect conflicts on word-by-word (or character-by-character) basis. I was wrong. From what I understood (source), deep down, the git-diff tool operates on lines and the word or character diff functionality works already with these line-based results returned by git.



      In this answer, it is suggested to make use of the clean and smudge filters in order to store each word on a separate line in the snapshot. However, that seems to me to be too hacky.



      What approach would you choose?







      git git-merge git-diff git-conflict-resolution






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 21 at 22:58

























      asked Nov 21 at 22:40









      Adam Libuša

      445516




      445516
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote













          What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



          In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



          If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



          The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



          Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



          (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



          The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53421427%2fautomatic-merging-in-git-without-conflicts-using-word-by-word-diff-instead-of-l%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



            In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



            If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



            The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



            Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



            (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



            The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)






            share|improve this answer

























              up vote
              1
              down vote













              What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



              In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



              If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



              The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



              Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



              (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



              The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)






              share|improve this answer























                up vote
                1
                down vote










                up vote
                1
                down vote









                What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



                In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



                If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



                The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



                Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



                (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



                The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)






                share|improve this answer












                What you would need to do, to make Git work the way you wish, is to modify the merge code. In theory this is not too difficult. In practice, I am not sure how difficult it would turn out to be.



                In that other answer, I mention xdelta. More precisely, Git uses modified versions of both xdelta and libxdiff. The Git source puts most of this code in a subdirectory. Up one level you'll find a few more bits of code that work with the library, such as xdiff-interface.c.



                If you modified these to allow the xdiff code to treat "words" (separated, presumably, by any white space) rather than "lines" as individual symbols for the Myers, patience, and histogram algorithms, and modified the calling code similarly, you should be able to get Git to do the merge based on words instead of lines. (Git now adds an "anchor" thing that you might need to do something about; I have not looked at how this works.) You'd have to choose how to insert any conflict markers as well—presumably, around these white-space-separated words.



                The algorithms themselves are concerned with matching (or failing to match) symbols in two different inputs. Unfortunately, the symbols are, in libxdiff, always lines. The standard (non-Git-modified) libxdiff interface is documented here, and the interface itself is centered on whole files, with the libxdiff code doing its own line-breaking.



                Internally in the modified xdiff, it looks as though Git assigns each line to a "record" so that the symbols it is comparing are record-by-record. If you assigned each white-space-separated word to a record instead, you'd mostly get what you want, ignoring the small matter of (later) dealing with any actual white-space separating the actual records. That is, in xdl_hash_record, all you would do is stop at any white-space instead of at newline, then discard additional white space between this line and the next when finding the "next" record, to build the records themselves. The code calling this changed diff might have to change as it may assume that "record number" implies "line number" (this is not clear to me off-hand).



                (It might also work better if you included leading or trailing white space in each record, and just had the comparison function, xdl_recmatch—in the same file—say that the symbols match if they match excluding their whitespace. Note that xdl_hash_record should hash the symbol minus the whitespace as well: the diff engine requires that the hashes match if the symbols match, and for performance, wants the hashes to differ if the symbols differ. The test, in essence, is this: symbols S1 and S2 with hashes H1 and H2 match if H1 == H2 and recmatch(S1,S2) says that they match. The H1==H2 test eliminates a lot of subroutine calls to slow "compare"s when the symbols obviously differ, but for symbols whose hashes match, the call is required to verify that they're really the same.)



                The main Myers algorithm itself has time complexity O(ND) where N is the number of symbols and D is the number of differences—i.e., the length of the eventual edit-script—between the two input-sets. When the symbols are lines, a 1000-line file has 1000 symbols; when the symbols are, instead, words, the 1000-line file may have 30000 symbols. So this will obviously be slower, but at least it's generally linearly slower. The histogram and patience algorithms are modifications of Myers that should behave similarly, time-wise, I think, but I haven't really studied them properly.)







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 21 at 23:40









                torek

                178k16228303




                178k16228303






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53421427%2fautomatic-merging-in-git-without-conflicts-using-word-by-word-diff-instead-of-l%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Trompette piccolo

                    Slow SSRS Report in dynamic grouping and multiple parameters

                    Simon Yates (cyclisme)