Elasticsearch Reindexing race condition











up vote
0
down vote

favorite












Hello elasticsearch users/experts,



I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.



I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).



As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.



My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :




  1. If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?

  2. If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?

  3. (Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?


Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?



Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?



Thanks in advance!










share|improve this question


























    up vote
    0
    down vote

    favorite












    Hello elasticsearch users/experts,



    I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.



    I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).



    As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
    Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.



    My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :




    1. If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?

    2. If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?

    3. (Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?


    Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?



    Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?



    Thanks in advance!










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      Hello elasticsearch users/experts,



      I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.



      I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).



      As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
      Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.



      My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :




      1. If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?

      2. If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?

      3. (Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?


      Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?



      Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?



      Thanks in advance!










      share|improve this question













      Hello elasticsearch users/experts,



      I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.



      I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).



      As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
      Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.



      My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :




      1. If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?

      2. If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?

      3. (Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?


      Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?



      Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?



      Thanks in advance!







      elasticsearch kibana






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 22 at 16:10









      WhileTrueContinue

      458




      458
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          2
          down vote



          accepted










          Apologies if its too verbose, but my two cents:




          If documents would still be incoming while the reindexing process is
          working (which would probably take a lot of time), how would the
          reindexing process ensure that the document would be ingested in the
          old index (to be able to search for it while the reindexing process is
          working) but still would be correctly reindexed to the new index?




          When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.



          Let's say the state of source_index changes from t to t+1



          If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.



          Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.



          Reindexing jobs will never always guarantee consistency between source_index and dest_index.




          If a document is modified in the old index, after it has been
          reindexed (mapped to the new index), while the reindexing process is
          working, how would ElasticSearch ensure that this modification is also
          taken account in the new index?




          It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.



          You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.



          You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).



          However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.




          (Similar to 2.) If a record is deleted in the old index, after it has
          been reindexed (mapped to the new index), while the reindexing process
          is working, how would ElasticSearch ensure that this removal is also
          taken account in the new index?




          Again, you need to run a new job/reindexing process.



          Version_type: External



          Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index



          You can refer to this LINK for more info on this



          POST _reindex
          {
          "source": {
          "index": "source_index"
          },
          "dest": {
          "index": "dest_index",
          "version_type": "external"
          }
          }





          share|improve this answer



















          • 1




            Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
            – WhileTrueContinue
            Nov 23 at 7:57








          • 1




            Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
            – Kamal
            Nov 23 at 8:23










          • I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
            – WhileTrueContinue
            Nov 23 at 8:29






          • 1




            Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
            – Kamal
            Nov 23 at 8:35











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434767%2felasticsearch-reindexing-race-condition%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          2
          down vote



          accepted










          Apologies if its too verbose, but my two cents:




          If documents would still be incoming while the reindexing process is
          working (which would probably take a lot of time), how would the
          reindexing process ensure that the document would be ingested in the
          old index (to be able to search for it while the reindexing process is
          working) but still would be correctly reindexed to the new index?




          When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.



          Let's say the state of source_index changes from t to t+1



          If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.



          Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.



          Reindexing jobs will never always guarantee consistency between source_index and dest_index.




          If a document is modified in the old index, after it has been
          reindexed (mapped to the new index), while the reindexing process is
          working, how would ElasticSearch ensure that this modification is also
          taken account in the new index?




          It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.



          You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.



          You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).



          However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.




          (Similar to 2.) If a record is deleted in the old index, after it has
          been reindexed (mapped to the new index), while the reindexing process
          is working, how would ElasticSearch ensure that this removal is also
          taken account in the new index?




          Again, you need to run a new job/reindexing process.



          Version_type: External



          Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index



          You can refer to this LINK for more info on this



          POST _reindex
          {
          "source": {
          "index": "source_index"
          },
          "dest": {
          "index": "dest_index",
          "version_type": "external"
          }
          }





          share|improve this answer



















          • 1




            Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
            – WhileTrueContinue
            Nov 23 at 7:57








          • 1




            Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
            – Kamal
            Nov 23 at 8:23










          • I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
            – WhileTrueContinue
            Nov 23 at 8:29






          • 1




            Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
            – Kamal
            Nov 23 at 8:35















          up vote
          2
          down vote



          accepted










          Apologies if its too verbose, but my two cents:




          If documents would still be incoming while the reindexing process is
          working (which would probably take a lot of time), how would the
          reindexing process ensure that the document would be ingested in the
          old index (to be able to search for it while the reindexing process is
          working) but still would be correctly reindexed to the new index?




          When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.



          Let's say the state of source_index changes from t to t+1



          If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.



          Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.



          Reindexing jobs will never always guarantee consistency between source_index and dest_index.




          If a document is modified in the old index, after it has been
          reindexed (mapped to the new index), while the reindexing process is
          working, how would ElasticSearch ensure that this modification is also
          taken account in the new index?




          It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.



          You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.



          You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).



          However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.




          (Similar to 2.) If a record is deleted in the old index, after it has
          been reindexed (mapped to the new index), while the reindexing process
          is working, how would ElasticSearch ensure that this removal is also
          taken account in the new index?




          Again, you need to run a new job/reindexing process.



          Version_type: External



          Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index



          You can refer to this LINK for more info on this



          POST _reindex
          {
          "source": {
          "index": "source_index"
          },
          "dest": {
          "index": "dest_index",
          "version_type": "external"
          }
          }





          share|improve this answer



















          • 1




            Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
            – WhileTrueContinue
            Nov 23 at 7:57








          • 1




            Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
            – Kamal
            Nov 23 at 8:23










          • I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
            – WhileTrueContinue
            Nov 23 at 8:29






          • 1




            Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
            – Kamal
            Nov 23 at 8:35













          up vote
          2
          down vote



          accepted







          up vote
          2
          down vote



          accepted






          Apologies if its too verbose, but my two cents:




          If documents would still be incoming while the reindexing process is
          working (which would probably take a lot of time), how would the
          reindexing process ensure that the document would be ingested in the
          old index (to be able to search for it while the reindexing process is
          working) but still would be correctly reindexed to the new index?




          When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.



          Let's say the state of source_index changes from t to t+1



          If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.



          Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.



          Reindexing jobs will never always guarantee consistency between source_index and dest_index.




          If a document is modified in the old index, after it has been
          reindexed (mapped to the new index), while the reindexing process is
          working, how would ElasticSearch ensure that this modification is also
          taken account in the new index?




          It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.



          You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.



          You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).



          However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.




          (Similar to 2.) If a record is deleted in the old index, after it has
          been reindexed (mapped to the new index), while the reindexing process
          is working, how would ElasticSearch ensure that this removal is also
          taken account in the new index?




          Again, you need to run a new job/reindexing process.



          Version_type: External



          Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index



          You can refer to this LINK for more info on this



          POST _reindex
          {
          "source": {
          "index": "source_index"
          },
          "dest": {
          "index": "dest_index",
          "version_type": "external"
          }
          }





          share|improve this answer














          Apologies if its too verbose, but my two cents:




          If documents would still be incoming while the reindexing process is
          working (which would probably take a lot of time), how would the
          reindexing process ensure that the document would be ingested in the
          old index (to be able to search for it while the reindexing process is
          working) but still would be correctly reindexed to the new index?




          When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.



          Let's say the state of source_index changes from t to t+1



          If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.



          Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.



          Reindexing jobs will never always guarantee consistency between source_index and dest_index.




          If a document is modified in the old index, after it has been
          reindexed (mapped to the new index), while the reindexing process is
          working, how would ElasticSearch ensure that this modification is also
          taken account in the new index?




          It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.



          You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.



          You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).



          However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.




          (Similar to 2.) If a record is deleted in the old index, after it has
          been reindexed (mapped to the new index), while the reindexing process
          is working, how would ElasticSearch ensure that this removal is also
          taken account in the new index?




          Again, you need to run a new job/reindexing process.



          Version_type: External



          Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index



          You can refer to this LINK for more info on this



          POST _reindex
          {
          "source": {
          "index": "source_index"
          },
          "dest": {
          "index": "dest_index",
          "version_type": "external"
          }
          }






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 22 at 17:45

























          answered Nov 22 at 16:57









          Kamal

          1,572820




          1,572820








          • 1




            Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
            – WhileTrueContinue
            Nov 23 at 7:57








          • 1




            Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
            – Kamal
            Nov 23 at 8:23










          • I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
            – WhileTrueContinue
            Nov 23 at 8:29






          • 1




            Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
            – Kamal
            Nov 23 at 8:35














          • 1




            Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
            – WhileTrueContinue
            Nov 23 at 7:57








          • 1




            Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
            – Kamal
            Nov 23 at 8:23










          • I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
            – WhileTrueContinue
            Nov 23 at 8:29






          • 1




            Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
            – Kamal
            Nov 23 at 8:35








          1




          1




          Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
          – WhileTrueContinue
          Nov 23 at 7:57






          Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
          – WhileTrueContinue
          Nov 23 at 7:57






          1




          1




          Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
          – Kamal
          Nov 23 at 8:23




          Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
          – Kamal
          Nov 23 at 8:23












          I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
          – WhileTrueContinue
          Nov 23 at 8:29




          I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
          – WhileTrueContinue
          Nov 23 at 8:29




          1




          1




          Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
          – Kamal
          Nov 23 at 8:35




          Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
          – Kamal
          Nov 23 at 8:35


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434767%2felasticsearch-reindexing-race-condition%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to ignore python UserWarning in pytest?

          What visual should I use to simply compare current year value vs last year in Power BI desktop

          Script to remove string up to first number