Elasticsearch Reindexing race condition
up vote
0
down vote
favorite
Hello elasticsearch users/experts,
I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.
I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).
As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.
My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :
- If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
- If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
- (Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?
Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?
Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?
Thanks in advance!
elasticsearch kibana
add a comment |
up vote
0
down vote
favorite
Hello elasticsearch users/experts,
I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.
I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).
As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.
My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :
- If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
- If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
- (Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?
Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?
Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?
Thanks in advance!
elasticsearch kibana
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Hello elasticsearch users/experts,
I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.
I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).
As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.
My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :
- If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
- If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
- (Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?
Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?
Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?
Thanks in advance!
elasticsearch kibana
Hello elasticsearch users/experts,
I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.
I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).
As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.
My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :
- If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
- If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
- (Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?
Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?
Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?
Thanks in advance!
elasticsearch kibana
elasticsearch kibana
asked Nov 22 at 16:10
WhileTrueContinue
458
458
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
accepted
Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index
. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index
changes from t
to t+1
If you have ran a reindexing job at t
to dest_index
, it would still consume the data of snapshot of source_index
at t
. You need to run reindexing job again to have latest data of source_index
i.e. data at t+1
in your dest_index
.
Ingestions at source_index
and ingestions from source_index
to destination_index
are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index
and dest_index
.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index
at time t
.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index
every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index
to dest_index
), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external
which would ensure only the updated/missing documents from source_index
would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}
1
Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
– WhileTrueContinue
Nov 23 at 7:57
1
Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
– Kamal
Nov 23 at 8:23
I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
– WhileTrueContinue
Nov 23 at 8:29
1
Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
– Kamal
Nov 23 at 8:35
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index
. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index
changes from t
to t+1
If you have ran a reindexing job at t
to dest_index
, it would still consume the data of snapshot of source_index
at t
. You need to run reindexing job again to have latest data of source_index
i.e. data at t+1
in your dest_index
.
Ingestions at source_index
and ingestions from source_index
to destination_index
are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index
and dest_index
.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index
at time t
.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index
every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index
to dest_index
), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external
which would ensure only the updated/missing documents from source_index
would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}
1
Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
– WhileTrueContinue
Nov 23 at 7:57
1
Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
– Kamal
Nov 23 at 8:23
I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
– WhileTrueContinue
Nov 23 at 8:29
1
Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
– Kamal
Nov 23 at 8:35
add a comment |
up vote
2
down vote
accepted
Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index
. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index
changes from t
to t+1
If you have ran a reindexing job at t
to dest_index
, it would still consume the data of snapshot of source_index
at t
. You need to run reindexing job again to have latest data of source_index
i.e. data at t+1
in your dest_index
.
Ingestions at source_index
and ingestions from source_index
to destination_index
are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index
and dest_index
.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index
at time t
.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index
every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index
to dest_index
), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external
which would ensure only the updated/missing documents from source_index
would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}
1
Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
– WhileTrueContinue
Nov 23 at 7:57
1
Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
– Kamal
Nov 23 at 8:23
I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
– WhileTrueContinue
Nov 23 at 8:29
1
Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
– Kamal
Nov 23 at 8:35
add a comment |
up vote
2
down vote
accepted
up vote
2
down vote
accepted
Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index
. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index
changes from t
to t+1
If you have ran a reindexing job at t
to dest_index
, it would still consume the data of snapshot of source_index
at t
. You need to run reindexing job again to have latest data of source_index
i.e. data at t+1
in your dest_index
.
Ingestions at source_index
and ingestions from source_index
to destination_index
are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index
and dest_index
.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index
at time t
.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index
every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index
to dest_index
), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external
which would ensure only the updated/missing documents from source_index
would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}
Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index
. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index
changes from t
to t+1
If you have ran a reindexing job at t
to dest_index
, it would still consume the data of snapshot of source_index
at t
. You need to run reindexing job again to have latest data of source_index
i.e. data at t+1
in your dest_index
.
Ingestions at source_index
and ingestions from source_index
to destination_index
are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index
and dest_index
.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index
at time t
.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index
every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index
to dest_index
), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external
which would ensure only the updated/missing documents from source_index
would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}
edited Nov 22 at 17:45
answered Nov 22 at 16:57
Kamal
1,572820
1,572820
1
Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
– WhileTrueContinue
Nov 23 at 7:57
1
Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
– Kamal
Nov 23 at 8:23
I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
– WhileTrueContinue
Nov 23 at 8:29
1
Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
– Kamal
Nov 23 at 8:35
add a comment |
1
Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
– WhileTrueContinue
Nov 23 at 7:57
1
Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
– Kamal
Nov 23 at 8:23
I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
– WhileTrueContinue
Nov 23 at 8:29
1
Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
– Kamal
Nov 23 at 8:35
1
1
Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
– WhileTrueContinue
Nov 23 at 7:57
Thanks for the detailed reply! Very nice hint for the version_type external, I did not know that! Unfortunately reindexing a lot of data would take a long time (maybe days), so I'm not sure if a scheduler is a good idea to run multiple reindexing tasks as it would slow down the platform for the whole duration? I will wait a day or two for other people to answer as well, but if noone else answers I'll accept your answer.
– WhileTrueContinue
Nov 23 at 7:57
1
1
Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
– Kamal
Nov 23 at 8:23
Sure. Sorry if my explanation wasn't crystal, well you should be using schedulers only for incremental updates, not for full indexing. What we do is have multiple jobs scheduled at different times for incremental updates but we rarely do full indexing(yet we still have a job to do full indexing). We've got like 30-40 different sources where we ingest in various indexes, but we ensure that the jobs are scheduled in such a way that not more than two or three incremental jobs would be running at a time.
– Kamal
Nov 23 at 8:23
I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
– WhileTrueContinue
Nov 23 at 8:29
I see, thanks for the reply! Your explanation was clear, incremental update seems like a good idea as full reindexing would be very consuming.
– WhileTrueContinue
Nov 23 at 8:29
1
1
Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
– Kamal
Nov 23 at 8:35
Just to add one more note, if you use scheduler, incremental updates would happen in a pull fashion i.e. schedulers would be the one extracting/pulling the updates. We are thinking to go in event based approach however that depends on the source content and which team manages it. If you have a control on source data(unfortunately we don't), I'd suggest have the messaging queue implemented so that any updates/events would be carried to elasticsearch in real time (source-as publisher and elasticsearch as consumers), which would eliminate the need for schedulers.
– Kamal
Nov 23 at 8:35
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434767%2felasticsearch-reindexing-race-condition%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown