How to detect lines that are unique in large file using Reactive Extensions











up vote
0
down vote

favorite












I have to process large CSV files (up to tens of GB), that looks like this:



Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true


I have a parser that yields parsed lines as IEnumerable<Record>, so that I reads only one line at a time into memory.



Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.



In this case it is record with key 3.



However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.



I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.



Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?










share|improve this question


















  • 1




    Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
    – TheGeneral
    Nov 22 at 8:17












  • records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key) might get you started. You'll need nuget.org/packages/morelinq for this.
    – mjwills
    Nov 22 at 8:44












  • How many distinct Keys do you have?
    – Dmitry Bychenko
    Nov 22 at 8:56










  • @TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
    – Liero
    Nov 22 at 9:18










  • @DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
    – Liero
    Nov 22 at 9:22















up vote
0
down vote

favorite












I have to process large CSV files (up to tens of GB), that looks like this:



Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true


I have a parser that yields parsed lines as IEnumerable<Record>, so that I reads only one line at a time into memory.



Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.



In this case it is record with key 3.



However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.



I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.



Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?










share|improve this question


















  • 1




    Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
    – TheGeneral
    Nov 22 at 8:17












  • records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key) might get you started. You'll need nuget.org/packages/morelinq for this.
    – mjwills
    Nov 22 at 8:44












  • How many distinct Keys do you have?
    – Dmitry Bychenko
    Nov 22 at 8:56










  • @TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
    – Liero
    Nov 22 at 9:18










  • @DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
    – Liero
    Nov 22 at 9:22













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have to process large CSV files (up to tens of GB), that looks like this:



Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true


I have a parser that yields parsed lines as IEnumerable<Record>, so that I reads only one line at a time into memory.



Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.



In this case it is record with key 3.



However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.



I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.



Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?










share|improve this question













I have to process large CSV files (up to tens of GB), that looks like this:



Key,CompletedA,CompletedB
1,true,NULL
2,true,NULL
3,false,NULL
1,NULL,true
2,NULL,true


I have a parser that yields parsed lines as IEnumerable<Record>, so that I reads only one line at a time into memory.



Now I have to group records by Key and check whether columns CompletedA and CompletedB have value within the group. On the output I need records, that does not have both CompletedA,CompletedB within the group.



In this case it is record with key 3.



However, there is many similar processings going on the same dataset and I don't wont to iterate over it multiple times.



I think I can convert IEnumerable into IObservable and use Reactive Extentions to find the records.



Is it possible to do it in memory efficient way with simple Linq expression over the IObservable collection?







c# system.reactive yield file-processing






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 22 at 8:13









Liero

9,17464199




9,17464199








  • 1




    Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
    – TheGeneral
    Nov 22 at 8:17












  • records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key) might get you started. You'll need nuget.org/packages/morelinq for this.
    – mjwills
    Nov 22 at 8:44












  • How many distinct Keys do you have?
    – Dmitry Bychenko
    Nov 22 at 8:56










  • @TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
    – Liero
    Nov 22 at 9:18










  • @DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
    – Liero
    Nov 22 at 9:22














  • 1




    Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
    – TheGeneral
    Nov 22 at 8:17












  • records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key) might get you started. You'll need nuget.org/packages/morelinq for this.
    – mjwills
    Nov 22 at 8:44












  • How many distinct Keys do you have?
    – Dmitry Bychenko
    Nov 22 at 8:56










  • @TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
    – Liero
    Nov 22 at 9:18










  • @DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
    – Liero
    Nov 22 at 9:22








1




1




Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
– TheGeneral
Nov 22 at 8:17






Sure, you could also use a pipeline processor like dataflow, orrrr Reactive Extensions, however, this is all overkill, you can do it efficiently in a foreach loop and you would be doing yourself a favor to try this first
– TheGeneral
Nov 22 at 8:17














records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key) might get you started. You'll need nuget.org/packages/morelinq for this.
– mjwills
Nov 22 at 8:44






records.CountBy(z => new { Key = z.Key, Value = z.CompletedA ?? z.CompletedB}).Where(z => z.Value == 1).Select(z => z.Key) might get you started. You'll need nuget.org/packages/morelinq for this.
– mjwills
Nov 22 at 8:44














How many distinct Keys do you have?
– Dmitry Bychenko
Nov 22 at 8:56




How many distinct Keys do you have?
– Dmitry Bychenko
Nov 22 at 8:56












@TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
– Liero
Nov 22 at 9:18




@TheGeneral: This is just one of many such analytics and I would have to do all of them in single foreach. There are also other reasons why foreach is not suitable
– Liero
Nov 22 at 9:18












@DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
– Liero
Nov 22 at 9:22




@DmitryBychenko: "How many distinct Keys do you have?": half the number of records or more. Nor sure how many lines will there be in production, but given the file size, a lot.
– Liero
Nov 22 at 9:22












2 Answers
2






active

oldest

votes

















up vote
2
down vote













Providing that Key is an integer we can try using a Dictionary and one scan:



 // value: 0b00 - neither A nor B
// 0b01 - A only
// 0b10 - B only
// 0b11 - Both A and B
Dictionary<int, byte> Status = new Dictionary<int, byte>();

var query = File
.ReadLines(@"c:MyFile.csv")
.Where(line => !string.IsNullOrWhiteSpace(line))
.Skip(1) // skip header
.Select(line => YourParserHere(line));

foreach (var record in query) {
int mask = (record.CompletedA != null ? 1 : 0) |
(record.CompletedB != null ? 2 : 0);

if (Status.TryGetValue(record.Key, out var value))
Status[record.Key] = (byte) (value | mask);
else
Status.Add(record.Key, (byte) mask);
}

// All keys that don't have 3 == 0b11 value (both A and B)
var bothAandB = Status
.Where(pair => pair.Value != 3)
.Select(pair => pair.Key);





share|improve this answer























  • The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
    – Liero
    Nov 22 at 9:27










  • @Liero - You would need to ensure that the IEnumerable<Record> is lazy to make Rx efficient, but if it is then a simple loop will be too.
    – Enigmativity
    Nov 22 at 11:09


















up vote
0
down vote













I think this will do what you need:



var result =
source
.GroupBy(x => x.Key)
.SelectMany(xs =>
(xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
? new List<Record>()
: xs.ToList());


Using Rx doesn't help here.






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53426450%2fhow-to-detect-lines-that-are-unique-in-large-file-using-reactive-extensions%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote













    Providing that Key is an integer we can try using a Dictionary and one scan:



     // value: 0b00 - neither A nor B
    // 0b01 - A only
    // 0b10 - B only
    // 0b11 - Both A and B
    Dictionary<int, byte> Status = new Dictionary<int, byte>();

    var query = File
    .ReadLines(@"c:MyFile.csv")
    .Where(line => !string.IsNullOrWhiteSpace(line))
    .Skip(1) // skip header
    .Select(line => YourParserHere(line));

    foreach (var record in query) {
    int mask = (record.CompletedA != null ? 1 : 0) |
    (record.CompletedB != null ? 2 : 0);

    if (Status.TryGetValue(record.Key, out var value))
    Status[record.Key] = (byte) (value | mask);
    else
    Status.Add(record.Key, (byte) mask);
    }

    // All keys that don't have 3 == 0b11 value (both A and B)
    var bothAandB = Status
    .Where(pair => pair.Value != 3)
    .Select(pair => pair.Key);





    share|improve this answer























    • The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
      – Liero
      Nov 22 at 9:27










    • @Liero - You would need to ensure that the IEnumerable<Record> is lazy to make Rx efficient, but if it is then a simple loop will be too.
      – Enigmativity
      Nov 22 at 11:09















    up vote
    2
    down vote













    Providing that Key is an integer we can try using a Dictionary and one scan:



     // value: 0b00 - neither A nor B
    // 0b01 - A only
    // 0b10 - B only
    // 0b11 - Both A and B
    Dictionary<int, byte> Status = new Dictionary<int, byte>();

    var query = File
    .ReadLines(@"c:MyFile.csv")
    .Where(line => !string.IsNullOrWhiteSpace(line))
    .Skip(1) // skip header
    .Select(line => YourParserHere(line));

    foreach (var record in query) {
    int mask = (record.CompletedA != null ? 1 : 0) |
    (record.CompletedB != null ? 2 : 0);

    if (Status.TryGetValue(record.Key, out var value))
    Status[record.Key] = (byte) (value | mask);
    else
    Status.Add(record.Key, (byte) mask);
    }

    // All keys that don't have 3 == 0b11 value (both A and B)
    var bothAandB = Status
    .Where(pair => pair.Value != 3)
    .Select(pair => pair.Key);





    share|improve this answer























    • The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
      – Liero
      Nov 22 at 9:27










    • @Liero - You would need to ensure that the IEnumerable<Record> is lazy to make Rx efficient, but if it is then a simple loop will be too.
      – Enigmativity
      Nov 22 at 11:09













    up vote
    2
    down vote










    up vote
    2
    down vote









    Providing that Key is an integer we can try using a Dictionary and one scan:



     // value: 0b00 - neither A nor B
    // 0b01 - A only
    // 0b10 - B only
    // 0b11 - Both A and B
    Dictionary<int, byte> Status = new Dictionary<int, byte>();

    var query = File
    .ReadLines(@"c:MyFile.csv")
    .Where(line => !string.IsNullOrWhiteSpace(line))
    .Skip(1) // skip header
    .Select(line => YourParserHere(line));

    foreach (var record in query) {
    int mask = (record.CompletedA != null ? 1 : 0) |
    (record.CompletedB != null ? 2 : 0);

    if (Status.TryGetValue(record.Key, out var value))
    Status[record.Key] = (byte) (value | mask);
    else
    Status.Add(record.Key, (byte) mask);
    }

    // All keys that don't have 3 == 0b11 value (both A and B)
    var bothAandB = Status
    .Where(pair => pair.Value != 3)
    .Select(pair => pair.Key);





    share|improve this answer














    Providing that Key is an integer we can try using a Dictionary and one scan:



     // value: 0b00 - neither A nor B
    // 0b01 - A only
    // 0b10 - B only
    // 0b11 - Both A and B
    Dictionary<int, byte> Status = new Dictionary<int, byte>();

    var query = File
    .ReadLines(@"c:MyFile.csv")
    .Where(line => !string.IsNullOrWhiteSpace(line))
    .Skip(1) // skip header
    .Select(line => YourParserHere(line));

    foreach (var record in query) {
    int mask = (record.CompletedA != null ? 1 : 0) |
    (record.CompletedB != null ? 2 : 0);

    if (Status.TryGetValue(record.Key, out var value))
    Status[record.Key] = (byte) (value | mask);
    else
    Status.Add(record.Key, (byte) mask);
    }

    // All keys that don't have 3 == 0b11 value (both A and B)
    var bothAandB = Status
    .Where(pair => pair.Value != 3)
    .Select(pair => pair.Key);






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 22 at 8:59

























    answered Nov 22 at 8:46









    Dmitry Bychenko

    104k992131




    104k992131












    • The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
      – Liero
      Nov 22 at 9:27










    • @Liero - You would need to ensure that the IEnumerable<Record> is lazy to make Rx efficient, but if it is then a simple loop will be too.
      – Enigmativity
      Nov 22 at 11:09


















    • The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
      – Liero
      Nov 22 at 9:27










    • @Liero - You would need to ensure that the IEnumerable<Record> is lazy to make Rx efficient, but if it is then a simple loop will be too.
      – Enigmativity
      Nov 22 at 11:09
















    The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
    – Liero
    Nov 22 at 9:27




    The reason I asked about RX solution is that there would be too many stuff in single foreach, so I need to split it somehow without enumerating the records multiple times, so I thought s push collection with multiple subscribers would work. Moreover, I have different scenarios with different set of "analytics". RX solution would make it nice reusable peace of single "analytics".
    – Liero
    Nov 22 at 9:27












    @Liero - You would need to ensure that the IEnumerable<Record> is lazy to make Rx efficient, but if it is then a simple loop will be too.
    – Enigmativity
    Nov 22 at 11:09




    @Liero - You would need to ensure that the IEnumerable<Record> is lazy to make Rx efficient, but if it is then a simple loop will be too.
    – Enigmativity
    Nov 22 at 11:09












    up vote
    0
    down vote













    I think this will do what you need:



    var result =
    source
    .GroupBy(x => x.Key)
    .SelectMany(xs =>
    (xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
    ? new List<Record>()
    : xs.ToList());


    Using Rx doesn't help here.






    share|improve this answer

























      up vote
      0
      down vote













      I think this will do what you need:



      var result =
      source
      .GroupBy(x => x.Key)
      .SelectMany(xs =>
      (xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
      ? new List<Record>()
      : xs.ToList());


      Using Rx doesn't help here.






      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        I think this will do what you need:



        var result =
        source
        .GroupBy(x => x.Key)
        .SelectMany(xs =>
        (xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
        ? new List<Record>()
        : xs.ToList());


        Using Rx doesn't help here.






        share|improve this answer












        I think this will do what you need:



        var result =
        source
        .GroupBy(x => x.Key)
        .SelectMany(xs =>
        (xs.Select(x => x.CompletedA).Any(x => x != null && x == true) && xs.Select(x => x.CompletedA).Any(x => x != null && x == true))
        ? new List<Record>()
        : xs.ToList());


        Using Rx doesn't help here.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 22 at 11:24









        Enigmativity

        74.6k764129




        74.6k764129






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53426450%2fhow-to-detect-lines-that-are-unique-in-large-file-using-reactive-extensions%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Trompette piccolo

            Slow SSRS Report in dynamic grouping and multiple parameters

            Simon Yates (cyclisme)