Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle












50














As per TensorFlow documentation , the prefetch and map methods of tf.contrib.data.Dataset class, both have a parameter called buffer_size.



For prefetch method, the parameter is known as buffer_size and according to documentation :




buffer_size: A tf.int64 scalar tf.Tensor, representing the maximum
number elements that will be buffered when prefetching.




For the map method, the parameter is known as output_buffer_size and according to documentation :




output_buffer_size: (Optional.) A tf.int64 scalar tf.Tensor,
representing the maximum number of processed elements that will be
buffered.




Similarly for the shuffle method, the same quantity appears and according to documentation :




buffer_size: A tf.int64 scalar tf.Tensor, representing the number of
elements from this dataset from which the new dataset will sample.




What is the relation between these parameters ?



Suppose I create aDataset object as follows :



 tr_data = TFRecordDataset(trainfilenames)
tr_data = tr_data.map(providefortraining, output_buffer_size=10 * trainbatchsize, num_parallel_calls
=5)
tr_data = tr_data.shuffle(buffer_size= 100 * trainbatchsize)
tr_data = tr_data.prefetch(buffer_size = 10 * trainbatchsize)
tr_data = tr_data.batch(trainbatchsize)


What role is being played by the buffer parameters in the above snippet ?










share|improve this question





























    50














    As per TensorFlow documentation , the prefetch and map methods of tf.contrib.data.Dataset class, both have a parameter called buffer_size.



    For prefetch method, the parameter is known as buffer_size and according to documentation :




    buffer_size: A tf.int64 scalar tf.Tensor, representing the maximum
    number elements that will be buffered when prefetching.




    For the map method, the parameter is known as output_buffer_size and according to documentation :




    output_buffer_size: (Optional.) A tf.int64 scalar tf.Tensor,
    representing the maximum number of processed elements that will be
    buffered.




    Similarly for the shuffle method, the same quantity appears and according to documentation :




    buffer_size: A tf.int64 scalar tf.Tensor, representing the number of
    elements from this dataset from which the new dataset will sample.




    What is the relation between these parameters ?



    Suppose I create aDataset object as follows :



     tr_data = TFRecordDataset(trainfilenames)
    tr_data = tr_data.map(providefortraining, output_buffer_size=10 * trainbatchsize, num_parallel_calls
    =5)
    tr_data = tr_data.shuffle(buffer_size= 100 * trainbatchsize)
    tr_data = tr_data.prefetch(buffer_size = 10 * trainbatchsize)
    tr_data = tr_data.batch(trainbatchsize)


    What role is being played by the buffer parameters in the above snippet ?










    share|improve this question



























      50












      50








      50


      30





      As per TensorFlow documentation , the prefetch and map methods of tf.contrib.data.Dataset class, both have a parameter called buffer_size.



      For prefetch method, the parameter is known as buffer_size and according to documentation :




      buffer_size: A tf.int64 scalar tf.Tensor, representing the maximum
      number elements that will be buffered when prefetching.




      For the map method, the parameter is known as output_buffer_size and according to documentation :




      output_buffer_size: (Optional.) A tf.int64 scalar tf.Tensor,
      representing the maximum number of processed elements that will be
      buffered.




      Similarly for the shuffle method, the same quantity appears and according to documentation :




      buffer_size: A tf.int64 scalar tf.Tensor, representing the number of
      elements from this dataset from which the new dataset will sample.




      What is the relation between these parameters ?



      Suppose I create aDataset object as follows :



       tr_data = TFRecordDataset(trainfilenames)
      tr_data = tr_data.map(providefortraining, output_buffer_size=10 * trainbatchsize, num_parallel_calls
      =5)
      tr_data = tr_data.shuffle(buffer_size= 100 * trainbatchsize)
      tr_data = tr_data.prefetch(buffer_size = 10 * trainbatchsize)
      tr_data = tr_data.batch(trainbatchsize)


      What role is being played by the buffer parameters in the above snippet ?










      share|improve this question















      As per TensorFlow documentation , the prefetch and map methods of tf.contrib.data.Dataset class, both have a parameter called buffer_size.



      For prefetch method, the parameter is known as buffer_size and according to documentation :




      buffer_size: A tf.int64 scalar tf.Tensor, representing the maximum
      number elements that will be buffered when prefetching.




      For the map method, the parameter is known as output_buffer_size and according to documentation :




      output_buffer_size: (Optional.) A tf.int64 scalar tf.Tensor,
      representing the maximum number of processed elements that will be
      buffered.




      Similarly for the shuffle method, the same quantity appears and according to documentation :




      buffer_size: A tf.int64 scalar tf.Tensor, representing the number of
      elements from this dataset from which the new dataset will sample.




      What is the relation between these parameters ?



      Suppose I create aDataset object as follows :



       tr_data = TFRecordDataset(trainfilenames)
      tr_data = tr_data.map(providefortraining, output_buffer_size=10 * trainbatchsize, num_parallel_calls
      =5)
      tr_data = tr_data.shuffle(buffer_size= 100 * trainbatchsize)
      tr_data = tr_data.prefetch(buffer_size = 10 * trainbatchsize)
      tr_data = tr_data.batch(trainbatchsize)


      What role is being played by the buffer parameters in the above snippet ?







      tensorflow tensorflow-gpu tensorflow-datasets






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 4 '17 at 17:02









      mrry

      95.5k12269330




      95.5k12269330










      asked Sep 27 '17 at 9:18









      Ujjwal

      4331516




      4331516
























          4 Answers
          4






          active

          oldest

          votes


















          80














          TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element.





          The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument in tf.contrib.data.Dataset.map() provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at most buffer_size elements, and a background thread to fill that buffer in the background.
          (Note that we removed the output_buffer_size argument from Dataset.map() when it moved from tf.contrib.data to tf.data. New code should use Dataset.prefetch() after map() to get the same behavior.)



          Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.



          By contrast, the buffer_size argument to tf.data.Dataset.shuffle() affects the randomness of the transformation. We designed the Dataset.shuffle() transformation (like the tf.train.shuffle_batch() function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer of buffer_size elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value of buffer_size affects how uniform the shuffling is: if buffer_size is greater than the number of elements in the dataset, you get a uniform shuffle; if it is 1 then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.









          share|improve this answer























          • For this explanation, I still have some confusions w.r.t tf.data.Dataset.shuffle(). I would like to know the exact shuffling process. Say, the first batch_size samples are randomly chosen from the first buffer_size elements, and so on.
            – Bs He
            Jul 10 '18 at 21:20





















          63














          Importance of buffer_size in shuffle()



          I wanted to follow up on the previous answer from @mrry to stress the importance of buffer_size in tf.data.Dataset.shuffle().



          Having a low buffer_size will not just give you inferior shuffling in some cases: it can mess up your whole training.





          A practical example: cat classifier



          Suppose for instance that you are training a cat classifier on images, and your data is organized in the following way (with 10000 images in each category):



          train/
          cat/
          filename_00001.jpg
          filename_00002.jpg
          ...
          not_cat/
          filename_10001.jpg
          filename_10002.jpg
          ...


          A standard way to input data with tf.data can be to have a list of filenames and a list of corresponding labels, and use tf.data.Dataset.from_tensor_slices() to create the dataset:



          filenames = ["filename_00001.jpg", "filename_00002.jpg", ..., 
          "filename_10001.jpg", "filename_10002.jpg", ...]
          labels = [1, 1, ..., 0, 0...] # 1 for cat, 0 for not_cat

          dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
          dataset = dataset.shuffle(buffer_size=1000) # 1000 should be enough right?
          dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...


          The big issue with the code above is that the dataset will actually not be shuffled in the right way. For about the first half of an epoch, we will only see cat images, and for the second half only non cat images. This will hurt training a lot.

          At the beginning of training, the dataset will take the first 1000 filenames and put them in its buffer, then pick one at random among them. Since all the first 1000 images are images of cat, we will only pick cat images at the beginning.



          The fix here is to make sure that buffer_size is larger than 20000, or to shuffle in advance filenames and labels (with the same indices obviously).



          Since storing all the filenames and labels in memory is not an issue, we can actually use buffer_size = len(filenames) to make sure that everything will be shuffled together. Make sure to call tf.data.Dataset.shuffle() before applying the heavy transformations (like reading the images, processing them, batching...).



          dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
          dataset = dataset.shuffle(buffer_size=len(filenames))
          dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...




          The takeaway is to always double check what the shuffling will do. A good way to catch these errors might be to plot the distribution of batches over time (make sure that batches contain about the same distribution as the training set, half cat and half non cat in our example).






          share|improve this answer



















          • 3




            Thank you. This is a phenomenally clear answer :)
            – Ujjwal
            Jan 4 '18 at 18:15










          • Then say, how the second sample is chosen? Randomly chosen from the array [filename_01001, ...filename_02000]? Or being chosen in another way? Meanwhile, I don't understand why using cat image at the very beginning is problematic, and why does the very first sampling so important?
            – Bs He
            Jul 10 '18 at 21:32










          • The next sample is always chosen from the buffer (of size 1000 here). So the first sample is taken from the first 1000 filenames. The buffer decreases to size 999, so it takes the next input (filename_01001) and adds it. The second sample is taken randomly from these 1000 filenames (1001 first filenames minus the first sample).
            – Olivier Moindrot
            Jul 11 '18 at 9:07










          • The issue with this low buffer size is that you will only have cats in your first batches. So the model will trivially learn to predict only "cat". The best way to train the network is to have batches with the same amount of "cat" and "non cat".
            – Olivier Moindrot
            Jul 11 '18 at 9:08










          • Does tensorflow has a direct way of plotting out the distribution of batches?
            – Elona Mishmika
            Sep 11 '18 at 7:05



















          2














          As mentioned above, @olivier-moindrot answer is not correct.
          For example.



          import tensorflow as  tf
          dataset = tf.data.Dataset.from_tensor_slices([0,1,2,3,4,5,6,7,8,9])
          dataset=dataset.shuffle(buffer_size=2)
          dataset = dataset.batch(batch_size=1)
          iterator = dataset.make_initializable_iterator()
          next_element=iterator.get_next()

          init_op = iterator.initializer

          with tf.Session() as sess:
          sess.run(init_op)
          for i in range(10):
          print(sess.run(next_element))


          and I got the following output:



          [1]
          [0]
          [3]
          [2]
          [4]
          [5]
          [7]
          [8]
          [9]
          [6]


          the key idea behind the buffer is , always keep buffer_size elements in memory. Once you randomly get a sample(batch) from buffer, you put next batch elements inside the buffer and sample form new buffer again.



           buffer:0,1, get a sample  [1]
          buffer:0,2, get a sample [0]
          buffer:2,3, get a sample [3]
          buffer:2,4, get a sample [2]
          buffer:4,5, get a sample [4]
          buffer:5,6, get a sample [5]
          buffer:6,7, get a sample [7]
          buffer:6,8, get a sample [8]
          buffer:6,9, get a sample [9]
          buffer:6 get a sample [6]





          share|improve this answer





























            1














            Actually the answer by @olivier-moindrot is not correct.



            You can verify it by creating filenames and labels as he/she mention and print the shuffle values.



            You will see each shuffle procedure will generate sample randomly with the size equals to buffer size from the dataset.



            dataset = dataset.shuffle(buffer_size=1000)
            iterator = dataset.make_one_shot_iterator()
            next_element = iterator.get_next()
            with tf.Session() as sess:
            for i in range(1000):
            print(sess.run(next_element))





            share|improve this answer





















              Your Answer






              StackExchange.ifUsing("editor", function () {
              StackExchange.using("externalEditor", function () {
              StackExchange.using("snippets", function () {
              StackExchange.snippets.init();
              });
              });
              }, "code-snippets");

              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "1"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f46444018%2fmeaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              4 Answers
              4






              active

              oldest

              votes








              4 Answers
              4






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              80














              TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element.





              The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument in tf.contrib.data.Dataset.map() provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at most buffer_size elements, and a background thread to fill that buffer in the background.
              (Note that we removed the output_buffer_size argument from Dataset.map() when it moved from tf.contrib.data to tf.data. New code should use Dataset.prefetch() after map() to get the same behavior.)



              Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.



              By contrast, the buffer_size argument to tf.data.Dataset.shuffle() affects the randomness of the transformation. We designed the Dataset.shuffle() transformation (like the tf.train.shuffle_batch() function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer of buffer_size elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value of buffer_size affects how uniform the shuffling is: if buffer_size is greater than the number of elements in the dataset, you get a uniform shuffle; if it is 1 then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.









              share|improve this answer























              • For this explanation, I still have some confusions w.r.t tf.data.Dataset.shuffle(). I would like to know the exact shuffling process. Say, the first batch_size samples are randomly chosen from the first buffer_size elements, and so on.
                – Bs He
                Jul 10 '18 at 21:20


















              80














              TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element.





              The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument in tf.contrib.data.Dataset.map() provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at most buffer_size elements, and a background thread to fill that buffer in the background.
              (Note that we removed the output_buffer_size argument from Dataset.map() when it moved from tf.contrib.data to tf.data. New code should use Dataset.prefetch() after map() to get the same behavior.)



              Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.



              By contrast, the buffer_size argument to tf.data.Dataset.shuffle() affects the randomness of the transformation. We designed the Dataset.shuffle() transformation (like the tf.train.shuffle_batch() function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer of buffer_size elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value of buffer_size affects how uniform the shuffling is: if buffer_size is greater than the number of elements in the dataset, you get a uniform shuffle; if it is 1 then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.









              share|improve this answer























              • For this explanation, I still have some confusions w.r.t tf.data.Dataset.shuffle(). I would like to know the exact shuffling process. Say, the first batch_size samples are randomly chosen from the first buffer_size elements, and so on.
                – Bs He
                Jul 10 '18 at 21:20
















              80












              80








              80






              TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element.





              The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument in tf.contrib.data.Dataset.map() provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at most buffer_size elements, and a background thread to fill that buffer in the background.
              (Note that we removed the output_buffer_size argument from Dataset.map() when it moved from tf.contrib.data to tf.data. New code should use Dataset.prefetch() after map() to get the same behavior.)



              Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.



              By contrast, the buffer_size argument to tf.data.Dataset.shuffle() affects the randomness of the transformation. We designed the Dataset.shuffle() transformation (like the tf.train.shuffle_batch() function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer of buffer_size elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value of buffer_size affects how uniform the shuffling is: if buffer_size is greater than the number of elements in the dataset, you get a uniform shuffle; if it is 1 then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.









              share|improve this answer














              TL;DR Despite their similar names, these arguments have quite difference meanings. The buffer_size in Dataset.shuffle() can affect the randomness of your dataset, and hence the order in which elements are produced. The buffer_size in Dataset.prefetch() only affects the time it takes to produce the next element.





              The buffer_size argument in tf.data.Dataset.prefetch() and the output_buffer_size argument in tf.contrib.data.Dataset.map() provide a way to tune the performance of your input pipeline: both arguments tell TensorFlow to create a buffer of at most buffer_size elements, and a background thread to fill that buffer in the background.
              (Note that we removed the output_buffer_size argument from Dataset.map() when it moved from tf.contrib.data to tf.data. New code should use Dataset.prefetch() after map() to get the same behavior.)



              Adding a prefetch buffer can improve performance by overlapping the preprocessing of data with downstream computation. Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.



              By contrast, the buffer_size argument to tf.data.Dataset.shuffle() affects the randomness of the transformation. We designed the Dataset.shuffle() transformation (like the tf.train.shuffle_batch() function that it replaces) to handle datasets that are too large to fit in memory. Instead of shuffling the entire dataset, it maintains a buffer of buffer_size elements, and randomly selects the next element from that buffer (replacing it with the next input element, if one is available). Changing the value of buffer_size affects how uniform the shuffling is: if buffer_size is greater than the number of elements in the dataset, you get a uniform shuffle; if it is 1 then you get no shuffling at all. For very large datasets, a typical "good enough" approach is to randomly shard the data into multiple files once before training, then shuffle the filenames uniformly, and then use a smaller shuffle buffer. However, the appropriate choice will depend on the exact nature of your training job.










              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Nov 13 '17 at 8:59









              Pop

              8,46634054




              8,46634054










              answered Oct 30 '17 at 23:44









              mrry

              95.5k12269330




              95.5k12269330












              • For this explanation, I still have some confusions w.r.t tf.data.Dataset.shuffle(). I would like to know the exact shuffling process. Say, the first batch_size samples are randomly chosen from the first buffer_size elements, and so on.
                – Bs He
                Jul 10 '18 at 21:20




















              • For this explanation, I still have some confusions w.r.t tf.data.Dataset.shuffle(). I would like to know the exact shuffling process. Say, the first batch_size samples are randomly chosen from the first buffer_size elements, and so on.
                – Bs He
                Jul 10 '18 at 21:20


















              For this explanation, I still have some confusions w.r.t tf.data.Dataset.shuffle(). I would like to know the exact shuffling process. Say, the first batch_size samples are randomly chosen from the first buffer_size elements, and so on.
              – Bs He
              Jul 10 '18 at 21:20






              For this explanation, I still have some confusions w.r.t tf.data.Dataset.shuffle(). I would like to know the exact shuffling process. Say, the first batch_size samples are randomly chosen from the first buffer_size elements, and so on.
              – Bs He
              Jul 10 '18 at 21:20















              63














              Importance of buffer_size in shuffle()



              I wanted to follow up on the previous answer from @mrry to stress the importance of buffer_size in tf.data.Dataset.shuffle().



              Having a low buffer_size will not just give you inferior shuffling in some cases: it can mess up your whole training.





              A practical example: cat classifier



              Suppose for instance that you are training a cat classifier on images, and your data is organized in the following way (with 10000 images in each category):



              train/
              cat/
              filename_00001.jpg
              filename_00002.jpg
              ...
              not_cat/
              filename_10001.jpg
              filename_10002.jpg
              ...


              A standard way to input data with tf.data can be to have a list of filenames and a list of corresponding labels, and use tf.data.Dataset.from_tensor_slices() to create the dataset:



              filenames = ["filename_00001.jpg", "filename_00002.jpg", ..., 
              "filename_10001.jpg", "filename_10002.jpg", ...]
              labels = [1, 1, ..., 0, 0...] # 1 for cat, 0 for not_cat

              dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
              dataset = dataset.shuffle(buffer_size=1000) # 1000 should be enough right?
              dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...


              The big issue with the code above is that the dataset will actually not be shuffled in the right way. For about the first half of an epoch, we will only see cat images, and for the second half only non cat images. This will hurt training a lot.

              At the beginning of training, the dataset will take the first 1000 filenames and put them in its buffer, then pick one at random among them. Since all the first 1000 images are images of cat, we will only pick cat images at the beginning.



              The fix here is to make sure that buffer_size is larger than 20000, or to shuffle in advance filenames and labels (with the same indices obviously).



              Since storing all the filenames and labels in memory is not an issue, we can actually use buffer_size = len(filenames) to make sure that everything will be shuffled together. Make sure to call tf.data.Dataset.shuffle() before applying the heavy transformations (like reading the images, processing them, batching...).



              dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
              dataset = dataset.shuffle(buffer_size=len(filenames))
              dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...




              The takeaway is to always double check what the shuffling will do. A good way to catch these errors might be to plot the distribution of batches over time (make sure that batches contain about the same distribution as the training set, half cat and half non cat in our example).






              share|improve this answer



















              • 3




                Thank you. This is a phenomenally clear answer :)
                – Ujjwal
                Jan 4 '18 at 18:15










              • Then say, how the second sample is chosen? Randomly chosen from the array [filename_01001, ...filename_02000]? Or being chosen in another way? Meanwhile, I don't understand why using cat image at the very beginning is problematic, and why does the very first sampling so important?
                – Bs He
                Jul 10 '18 at 21:32










              • The next sample is always chosen from the buffer (of size 1000 here). So the first sample is taken from the first 1000 filenames. The buffer decreases to size 999, so it takes the next input (filename_01001) and adds it. The second sample is taken randomly from these 1000 filenames (1001 first filenames minus the first sample).
                – Olivier Moindrot
                Jul 11 '18 at 9:07










              • The issue with this low buffer size is that you will only have cats in your first batches. So the model will trivially learn to predict only "cat". The best way to train the network is to have batches with the same amount of "cat" and "non cat".
                – Olivier Moindrot
                Jul 11 '18 at 9:08










              • Does tensorflow has a direct way of plotting out the distribution of batches?
                – Elona Mishmika
                Sep 11 '18 at 7:05
















              63














              Importance of buffer_size in shuffle()



              I wanted to follow up on the previous answer from @mrry to stress the importance of buffer_size in tf.data.Dataset.shuffle().



              Having a low buffer_size will not just give you inferior shuffling in some cases: it can mess up your whole training.





              A practical example: cat classifier



              Suppose for instance that you are training a cat classifier on images, and your data is organized in the following way (with 10000 images in each category):



              train/
              cat/
              filename_00001.jpg
              filename_00002.jpg
              ...
              not_cat/
              filename_10001.jpg
              filename_10002.jpg
              ...


              A standard way to input data with tf.data can be to have a list of filenames and a list of corresponding labels, and use tf.data.Dataset.from_tensor_slices() to create the dataset:



              filenames = ["filename_00001.jpg", "filename_00002.jpg", ..., 
              "filename_10001.jpg", "filename_10002.jpg", ...]
              labels = [1, 1, ..., 0, 0...] # 1 for cat, 0 for not_cat

              dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
              dataset = dataset.shuffle(buffer_size=1000) # 1000 should be enough right?
              dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...


              The big issue with the code above is that the dataset will actually not be shuffled in the right way. For about the first half of an epoch, we will only see cat images, and for the second half only non cat images. This will hurt training a lot.

              At the beginning of training, the dataset will take the first 1000 filenames and put them in its buffer, then pick one at random among them. Since all the first 1000 images are images of cat, we will only pick cat images at the beginning.



              The fix here is to make sure that buffer_size is larger than 20000, or to shuffle in advance filenames and labels (with the same indices obviously).



              Since storing all the filenames and labels in memory is not an issue, we can actually use buffer_size = len(filenames) to make sure that everything will be shuffled together. Make sure to call tf.data.Dataset.shuffle() before applying the heavy transformations (like reading the images, processing them, batching...).



              dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
              dataset = dataset.shuffle(buffer_size=len(filenames))
              dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...




              The takeaway is to always double check what the shuffling will do. A good way to catch these errors might be to plot the distribution of batches over time (make sure that batches contain about the same distribution as the training set, half cat and half non cat in our example).






              share|improve this answer



















              • 3




                Thank you. This is a phenomenally clear answer :)
                – Ujjwal
                Jan 4 '18 at 18:15










              • Then say, how the second sample is chosen? Randomly chosen from the array [filename_01001, ...filename_02000]? Or being chosen in another way? Meanwhile, I don't understand why using cat image at the very beginning is problematic, and why does the very first sampling so important?
                – Bs He
                Jul 10 '18 at 21:32










              • The next sample is always chosen from the buffer (of size 1000 here). So the first sample is taken from the first 1000 filenames. The buffer decreases to size 999, so it takes the next input (filename_01001) and adds it. The second sample is taken randomly from these 1000 filenames (1001 first filenames minus the first sample).
                – Olivier Moindrot
                Jul 11 '18 at 9:07










              • The issue with this low buffer size is that you will only have cats in your first batches. So the model will trivially learn to predict only "cat". The best way to train the network is to have batches with the same amount of "cat" and "non cat".
                – Olivier Moindrot
                Jul 11 '18 at 9:08










              • Does tensorflow has a direct way of plotting out the distribution of batches?
                – Elona Mishmika
                Sep 11 '18 at 7:05














              63












              63








              63






              Importance of buffer_size in shuffle()



              I wanted to follow up on the previous answer from @mrry to stress the importance of buffer_size in tf.data.Dataset.shuffle().



              Having a low buffer_size will not just give you inferior shuffling in some cases: it can mess up your whole training.





              A practical example: cat classifier



              Suppose for instance that you are training a cat classifier on images, and your data is organized in the following way (with 10000 images in each category):



              train/
              cat/
              filename_00001.jpg
              filename_00002.jpg
              ...
              not_cat/
              filename_10001.jpg
              filename_10002.jpg
              ...


              A standard way to input data with tf.data can be to have a list of filenames and a list of corresponding labels, and use tf.data.Dataset.from_tensor_slices() to create the dataset:



              filenames = ["filename_00001.jpg", "filename_00002.jpg", ..., 
              "filename_10001.jpg", "filename_10002.jpg", ...]
              labels = [1, 1, ..., 0, 0...] # 1 for cat, 0 for not_cat

              dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
              dataset = dataset.shuffle(buffer_size=1000) # 1000 should be enough right?
              dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...


              The big issue with the code above is that the dataset will actually not be shuffled in the right way. For about the first half of an epoch, we will only see cat images, and for the second half only non cat images. This will hurt training a lot.

              At the beginning of training, the dataset will take the first 1000 filenames and put them in its buffer, then pick one at random among them. Since all the first 1000 images are images of cat, we will only pick cat images at the beginning.



              The fix here is to make sure that buffer_size is larger than 20000, or to shuffle in advance filenames and labels (with the same indices obviously).



              Since storing all the filenames and labels in memory is not an issue, we can actually use buffer_size = len(filenames) to make sure that everything will be shuffled together. Make sure to call tf.data.Dataset.shuffle() before applying the heavy transformations (like reading the images, processing them, batching...).



              dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
              dataset = dataset.shuffle(buffer_size=len(filenames))
              dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...




              The takeaway is to always double check what the shuffling will do. A good way to catch these errors might be to plot the distribution of batches over time (make sure that batches contain about the same distribution as the training set, half cat and half non cat in our example).






              share|improve this answer














              Importance of buffer_size in shuffle()



              I wanted to follow up on the previous answer from @mrry to stress the importance of buffer_size in tf.data.Dataset.shuffle().



              Having a low buffer_size will not just give you inferior shuffling in some cases: it can mess up your whole training.





              A practical example: cat classifier



              Suppose for instance that you are training a cat classifier on images, and your data is organized in the following way (with 10000 images in each category):



              train/
              cat/
              filename_00001.jpg
              filename_00002.jpg
              ...
              not_cat/
              filename_10001.jpg
              filename_10002.jpg
              ...


              A standard way to input data with tf.data can be to have a list of filenames and a list of corresponding labels, and use tf.data.Dataset.from_tensor_slices() to create the dataset:



              filenames = ["filename_00001.jpg", "filename_00002.jpg", ..., 
              "filename_10001.jpg", "filename_10002.jpg", ...]
              labels = [1, 1, ..., 0, 0...] # 1 for cat, 0 for not_cat

              dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
              dataset = dataset.shuffle(buffer_size=1000) # 1000 should be enough right?
              dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...


              The big issue with the code above is that the dataset will actually not be shuffled in the right way. For about the first half of an epoch, we will only see cat images, and for the second half only non cat images. This will hurt training a lot.

              At the beginning of training, the dataset will take the first 1000 filenames and put them in its buffer, then pick one at random among them. Since all the first 1000 images are images of cat, we will only pick cat images at the beginning.



              The fix here is to make sure that buffer_size is larger than 20000, or to shuffle in advance filenames and labels (with the same indices obviously).



              Since storing all the filenames and labels in memory is not an issue, we can actually use buffer_size = len(filenames) to make sure that everything will be shuffled together. Make sure to call tf.data.Dataset.shuffle() before applying the heavy transformations (like reading the images, processing them, batching...).



              dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
              dataset = dataset.shuffle(buffer_size=len(filenames))
              dataset = dataset.map(...) # transform to images, preprocess, repeat, batch...




              The takeaway is to always double check what the shuffling will do. A good way to catch these errors might be to plot the distribution of batches over time (make sure that batches contain about the same distribution as the training set, half cat and half non cat in our example).







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Apr 6 '18 at 14:27

























              answered Jan 4 '18 at 13:44









              Olivier Moindrot

              18.7k66373




              18.7k66373








              • 3




                Thank you. This is a phenomenally clear answer :)
                – Ujjwal
                Jan 4 '18 at 18:15










              • Then say, how the second sample is chosen? Randomly chosen from the array [filename_01001, ...filename_02000]? Or being chosen in another way? Meanwhile, I don't understand why using cat image at the very beginning is problematic, and why does the very first sampling so important?
                – Bs He
                Jul 10 '18 at 21:32










              • The next sample is always chosen from the buffer (of size 1000 here). So the first sample is taken from the first 1000 filenames. The buffer decreases to size 999, so it takes the next input (filename_01001) and adds it. The second sample is taken randomly from these 1000 filenames (1001 first filenames minus the first sample).
                – Olivier Moindrot
                Jul 11 '18 at 9:07










              • The issue with this low buffer size is that you will only have cats in your first batches. So the model will trivially learn to predict only "cat". The best way to train the network is to have batches with the same amount of "cat" and "non cat".
                – Olivier Moindrot
                Jul 11 '18 at 9:08










              • Does tensorflow has a direct way of plotting out the distribution of batches?
                – Elona Mishmika
                Sep 11 '18 at 7:05














              • 3




                Thank you. This is a phenomenally clear answer :)
                – Ujjwal
                Jan 4 '18 at 18:15










              • Then say, how the second sample is chosen? Randomly chosen from the array [filename_01001, ...filename_02000]? Or being chosen in another way? Meanwhile, I don't understand why using cat image at the very beginning is problematic, and why does the very first sampling so important?
                – Bs He
                Jul 10 '18 at 21:32










              • The next sample is always chosen from the buffer (of size 1000 here). So the first sample is taken from the first 1000 filenames. The buffer decreases to size 999, so it takes the next input (filename_01001) and adds it. The second sample is taken randomly from these 1000 filenames (1001 first filenames minus the first sample).
                – Olivier Moindrot
                Jul 11 '18 at 9:07










              • The issue with this low buffer size is that you will only have cats in your first batches. So the model will trivially learn to predict only "cat". The best way to train the network is to have batches with the same amount of "cat" and "non cat".
                – Olivier Moindrot
                Jul 11 '18 at 9:08










              • Does tensorflow has a direct way of plotting out the distribution of batches?
                – Elona Mishmika
                Sep 11 '18 at 7:05








              3




              3




              Thank you. This is a phenomenally clear answer :)
              – Ujjwal
              Jan 4 '18 at 18:15




              Thank you. This is a phenomenally clear answer :)
              – Ujjwal
              Jan 4 '18 at 18:15












              Then say, how the second sample is chosen? Randomly chosen from the array [filename_01001, ...filename_02000]? Or being chosen in another way? Meanwhile, I don't understand why using cat image at the very beginning is problematic, and why does the very first sampling so important?
              – Bs He
              Jul 10 '18 at 21:32




              Then say, how the second sample is chosen? Randomly chosen from the array [filename_01001, ...filename_02000]? Or being chosen in another way? Meanwhile, I don't understand why using cat image at the very beginning is problematic, and why does the very first sampling so important?
              – Bs He
              Jul 10 '18 at 21:32












              The next sample is always chosen from the buffer (of size 1000 here). So the first sample is taken from the first 1000 filenames. The buffer decreases to size 999, so it takes the next input (filename_01001) and adds it. The second sample is taken randomly from these 1000 filenames (1001 first filenames minus the first sample).
              – Olivier Moindrot
              Jul 11 '18 at 9:07




              The next sample is always chosen from the buffer (of size 1000 here). So the first sample is taken from the first 1000 filenames. The buffer decreases to size 999, so it takes the next input (filename_01001) and adds it. The second sample is taken randomly from these 1000 filenames (1001 first filenames minus the first sample).
              – Olivier Moindrot
              Jul 11 '18 at 9:07












              The issue with this low buffer size is that you will only have cats in your first batches. So the model will trivially learn to predict only "cat". The best way to train the network is to have batches with the same amount of "cat" and "non cat".
              – Olivier Moindrot
              Jul 11 '18 at 9:08




              The issue with this low buffer size is that you will only have cats in your first batches. So the model will trivially learn to predict only "cat". The best way to train the network is to have batches with the same amount of "cat" and "non cat".
              – Olivier Moindrot
              Jul 11 '18 at 9:08












              Does tensorflow has a direct way of plotting out the distribution of batches?
              – Elona Mishmika
              Sep 11 '18 at 7:05




              Does tensorflow has a direct way of plotting out the distribution of batches?
              – Elona Mishmika
              Sep 11 '18 at 7:05











              2














              As mentioned above, @olivier-moindrot answer is not correct.
              For example.



              import tensorflow as  tf
              dataset = tf.data.Dataset.from_tensor_slices([0,1,2,3,4,5,6,7,8,9])
              dataset=dataset.shuffle(buffer_size=2)
              dataset = dataset.batch(batch_size=1)
              iterator = dataset.make_initializable_iterator()
              next_element=iterator.get_next()

              init_op = iterator.initializer

              with tf.Session() as sess:
              sess.run(init_op)
              for i in range(10):
              print(sess.run(next_element))


              and I got the following output:



              [1]
              [0]
              [3]
              [2]
              [4]
              [5]
              [7]
              [8]
              [9]
              [6]


              the key idea behind the buffer is , always keep buffer_size elements in memory. Once you randomly get a sample(batch) from buffer, you put next batch elements inside the buffer and sample form new buffer again.



               buffer:0,1, get a sample  [1]
              buffer:0,2, get a sample [0]
              buffer:2,3, get a sample [3]
              buffer:2,4, get a sample [2]
              buffer:4,5, get a sample [4]
              buffer:5,6, get a sample [5]
              buffer:6,7, get a sample [7]
              buffer:6,8, get a sample [8]
              buffer:6,9, get a sample [9]
              buffer:6 get a sample [6]





              share|improve this answer


























                2














                As mentioned above, @olivier-moindrot answer is not correct.
                For example.



                import tensorflow as  tf
                dataset = tf.data.Dataset.from_tensor_slices([0,1,2,3,4,5,6,7,8,9])
                dataset=dataset.shuffle(buffer_size=2)
                dataset = dataset.batch(batch_size=1)
                iterator = dataset.make_initializable_iterator()
                next_element=iterator.get_next()

                init_op = iterator.initializer

                with tf.Session() as sess:
                sess.run(init_op)
                for i in range(10):
                print(sess.run(next_element))


                and I got the following output:



                [1]
                [0]
                [3]
                [2]
                [4]
                [5]
                [7]
                [8]
                [9]
                [6]


                the key idea behind the buffer is , always keep buffer_size elements in memory. Once you randomly get a sample(batch) from buffer, you put next batch elements inside the buffer and sample form new buffer again.



                 buffer:0,1, get a sample  [1]
                buffer:0,2, get a sample [0]
                buffer:2,3, get a sample [3]
                buffer:2,4, get a sample [2]
                buffer:4,5, get a sample [4]
                buffer:5,6, get a sample [5]
                buffer:6,7, get a sample [7]
                buffer:6,8, get a sample [8]
                buffer:6,9, get a sample [9]
                buffer:6 get a sample [6]





                share|improve this answer
























                  2












                  2








                  2






                  As mentioned above, @olivier-moindrot answer is not correct.
                  For example.



                  import tensorflow as  tf
                  dataset = tf.data.Dataset.from_tensor_slices([0,1,2,3,4,5,6,7,8,9])
                  dataset=dataset.shuffle(buffer_size=2)
                  dataset = dataset.batch(batch_size=1)
                  iterator = dataset.make_initializable_iterator()
                  next_element=iterator.get_next()

                  init_op = iterator.initializer

                  with tf.Session() as sess:
                  sess.run(init_op)
                  for i in range(10):
                  print(sess.run(next_element))


                  and I got the following output:



                  [1]
                  [0]
                  [3]
                  [2]
                  [4]
                  [5]
                  [7]
                  [8]
                  [9]
                  [6]


                  the key idea behind the buffer is , always keep buffer_size elements in memory. Once you randomly get a sample(batch) from buffer, you put next batch elements inside the buffer and sample form new buffer again.



                   buffer:0,1, get a sample  [1]
                  buffer:0,2, get a sample [0]
                  buffer:2,3, get a sample [3]
                  buffer:2,4, get a sample [2]
                  buffer:4,5, get a sample [4]
                  buffer:5,6, get a sample [5]
                  buffer:6,7, get a sample [7]
                  buffer:6,8, get a sample [8]
                  buffer:6,9, get a sample [9]
                  buffer:6 get a sample [6]





                  share|improve this answer












                  As mentioned above, @olivier-moindrot answer is not correct.
                  For example.



                  import tensorflow as  tf
                  dataset = tf.data.Dataset.from_tensor_slices([0,1,2,3,4,5,6,7,8,9])
                  dataset=dataset.shuffle(buffer_size=2)
                  dataset = dataset.batch(batch_size=1)
                  iterator = dataset.make_initializable_iterator()
                  next_element=iterator.get_next()

                  init_op = iterator.initializer

                  with tf.Session() as sess:
                  sess.run(init_op)
                  for i in range(10):
                  print(sess.run(next_element))


                  and I got the following output:



                  [1]
                  [0]
                  [3]
                  [2]
                  [4]
                  [5]
                  [7]
                  [8]
                  [9]
                  [6]


                  the key idea behind the buffer is , always keep buffer_size elements in memory. Once you randomly get a sample(batch) from buffer, you put next batch elements inside the buffer and sample form new buffer again.



                   buffer:0,1, get a sample  [1]
                  buffer:0,2, get a sample [0]
                  buffer:2,3, get a sample [3]
                  buffer:2,4, get a sample [2]
                  buffer:4,5, get a sample [4]
                  buffer:5,6, get a sample [5]
                  buffer:6,7, get a sample [7]
                  buffer:6,8, get a sample [8]
                  buffer:6,9, get a sample [9]
                  buffer:6 get a sample [6]






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 23 '18 at 2:09









                  Houtarou Oreki

                  211




                  211























                      1














                      Actually the answer by @olivier-moindrot is not correct.



                      You can verify it by creating filenames and labels as he/she mention and print the shuffle values.



                      You will see each shuffle procedure will generate sample randomly with the size equals to buffer size from the dataset.



                      dataset = dataset.shuffle(buffer_size=1000)
                      iterator = dataset.make_one_shot_iterator()
                      next_element = iterator.get_next()
                      with tf.Session() as sess:
                      for i in range(1000):
                      print(sess.run(next_element))





                      share|improve this answer


























                        1














                        Actually the answer by @olivier-moindrot is not correct.



                        You can verify it by creating filenames and labels as he/she mention and print the shuffle values.



                        You will see each shuffle procedure will generate sample randomly with the size equals to buffer size from the dataset.



                        dataset = dataset.shuffle(buffer_size=1000)
                        iterator = dataset.make_one_shot_iterator()
                        next_element = iterator.get_next()
                        with tf.Session() as sess:
                        for i in range(1000):
                        print(sess.run(next_element))





                        share|improve this answer
























                          1












                          1








                          1






                          Actually the answer by @olivier-moindrot is not correct.



                          You can verify it by creating filenames and labels as he/she mention and print the shuffle values.



                          You will see each shuffle procedure will generate sample randomly with the size equals to buffer size from the dataset.



                          dataset = dataset.shuffle(buffer_size=1000)
                          iterator = dataset.make_one_shot_iterator()
                          next_element = iterator.get_next()
                          with tf.Session() as sess:
                          for i in range(1000):
                          print(sess.run(next_element))





                          share|improve this answer












                          Actually the answer by @olivier-moindrot is not correct.



                          You can verify it by creating filenames and labels as he/she mention and print the shuffle values.



                          You will see each shuffle procedure will generate sample randomly with the size equals to buffer size from the dataset.



                          dataset = dataset.shuffle(buffer_size=1000)
                          iterator = dataset.make_one_shot_iterator()
                          next_element = iterator.get_next()
                          with tf.Session() as sess:
                          for i in range(1000):
                          print(sess.run(next_element))






                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Nov 7 '18 at 16:49









                          Isaac Cheng

                          112




                          112






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f46444018%2fmeaning-of-buffer-size-in-dataset-map-dataset-prefetch-and-dataset-shuffle%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              How to ignore python UserWarning in pytest?

                              What visual should I use to simply compare current year value vs last year in Power BI desktop

                              Script to remove string up to first number