Python (Selenium/BeautifulSoup) Search Result Dynamic URL
Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
python selenium selenium-webdriver beautifulsoup
add a comment |
Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
python selenium selenium-webdriver beautifulsoup
What are the components that you are trying to access? (but not available) Selenium should load all javascript before returning html object to be parsed by BeautifulSoup
– Joseph Choi
Nov 23 at 4:30
Hi Joseph, I'm trying to access <search-result> tags from the final destination page (per my question, if I were to enter one of the original URLs into my Chrome search bar, the page would load sequentially and I would see the URL change twice, until it landed on '/#/searchResults/1' (the same URL no matter the offender being searched) -- any idea how to ensure Selenium does not pull data from the first URL in the series of redirects?
– OJT
Nov 23 at 5:32
When I try to connect to the link provided, I get redirected to an unauthorized page vinelink.com/#/unauthorized From my experience and testing, lines after driver.get(url) are only executed after the browser has finished loading. Selenium is designed to emulate web browsing experience same as a human would. Can you confirm the html you receive from driver.page_source is different from what you would get when you are browsing yourself?
– Joseph Choi
Nov 23 at 6:33
Try calling soup.find_all("search-result") to confirm that you are not getting the data you need
– Joseph Choi
Nov 23 at 6:35
Hi Joseph, based on what you've written, I found that I could actually bypass the driver.page_source call, and instead insert driver.implicitly_wait(5) before beginning to scrape data (this allows sufficient time for the browsing emulation to reach the destination page). Thank you very much! I now have a new problem (reCAPTCHA prevents me from collecting data from more than a few of the URLs in my list), but I will create a separate question for this!
– OJT
Nov 23 at 7:12
add a comment |
Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
python selenium selenium-webdriver beautifulsoup
Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
python selenium selenium-webdriver beautifulsoup
python selenium selenium-webdriver beautifulsoup
edited Nov 22 at 19:45
JaSON
2199
2199
asked Nov 22 at 17:53
OJT
85
85
What are the components that you are trying to access? (but not available) Selenium should load all javascript before returning html object to be parsed by BeautifulSoup
– Joseph Choi
Nov 23 at 4:30
Hi Joseph, I'm trying to access <search-result> tags from the final destination page (per my question, if I were to enter one of the original URLs into my Chrome search bar, the page would load sequentially and I would see the URL change twice, until it landed on '/#/searchResults/1' (the same URL no matter the offender being searched) -- any idea how to ensure Selenium does not pull data from the first URL in the series of redirects?
– OJT
Nov 23 at 5:32
When I try to connect to the link provided, I get redirected to an unauthorized page vinelink.com/#/unauthorized From my experience and testing, lines after driver.get(url) are only executed after the browser has finished loading. Selenium is designed to emulate web browsing experience same as a human would. Can you confirm the html you receive from driver.page_source is different from what you would get when you are browsing yourself?
– Joseph Choi
Nov 23 at 6:33
Try calling soup.find_all("search-result") to confirm that you are not getting the data you need
– Joseph Choi
Nov 23 at 6:35
Hi Joseph, based on what you've written, I found that I could actually bypass the driver.page_source call, and instead insert driver.implicitly_wait(5) before beginning to scrape data (this allows sufficient time for the browsing emulation to reach the destination page). Thank you very much! I now have a new problem (reCAPTCHA prevents me from collecting data from more than a few of the URLs in my list), but I will create a separate question for this!
– OJT
Nov 23 at 7:12
add a comment |
What are the components that you are trying to access? (but not available) Selenium should load all javascript before returning html object to be parsed by BeautifulSoup
– Joseph Choi
Nov 23 at 4:30
Hi Joseph, I'm trying to access <search-result> tags from the final destination page (per my question, if I were to enter one of the original URLs into my Chrome search bar, the page would load sequentially and I would see the URL change twice, until it landed on '/#/searchResults/1' (the same URL no matter the offender being searched) -- any idea how to ensure Selenium does not pull data from the first URL in the series of redirects?
– OJT
Nov 23 at 5:32
When I try to connect to the link provided, I get redirected to an unauthorized page vinelink.com/#/unauthorized From my experience and testing, lines after driver.get(url) are only executed after the browser has finished loading. Selenium is designed to emulate web browsing experience same as a human would. Can you confirm the html you receive from driver.page_source is different from what you would get when you are browsing yourself?
– Joseph Choi
Nov 23 at 6:33
Try calling soup.find_all("search-result") to confirm that you are not getting the data you need
– Joseph Choi
Nov 23 at 6:35
Hi Joseph, based on what you've written, I found that I could actually bypass the driver.page_source call, and instead insert driver.implicitly_wait(5) before beginning to scrape data (this allows sufficient time for the browsing emulation to reach the destination page). Thank you very much! I now have a new problem (reCAPTCHA prevents me from collecting data from more than a few of the URLs in my list), but I will create a separate question for this!
– OJT
Nov 23 at 7:12
What are the components that you are trying to access? (but not available) Selenium should load all javascript before returning html object to be parsed by BeautifulSoup
– Joseph Choi
Nov 23 at 4:30
What are the components that you are trying to access? (but not available) Selenium should load all javascript before returning html object to be parsed by BeautifulSoup
– Joseph Choi
Nov 23 at 4:30
Hi Joseph, I'm trying to access <search-result> tags from the final destination page (per my question, if I were to enter one of the original URLs into my Chrome search bar, the page would load sequentially and I would see the URL change twice, until it landed on '/#/searchResults/1' (the same URL no matter the offender being searched) -- any idea how to ensure Selenium does not pull data from the first URL in the series of redirects?
– OJT
Nov 23 at 5:32
Hi Joseph, I'm trying to access <search-result> tags from the final destination page (per my question, if I were to enter one of the original URLs into my Chrome search bar, the page would load sequentially and I would see the URL change twice, until it landed on '/#/searchResults/1' (the same URL no matter the offender being searched) -- any idea how to ensure Selenium does not pull data from the first URL in the series of redirects?
– OJT
Nov 23 at 5:32
When I try to connect to the link provided, I get redirected to an unauthorized page vinelink.com/#/unauthorized From my experience and testing, lines after driver.get(url) are only executed after the browser has finished loading. Selenium is designed to emulate web browsing experience same as a human would. Can you confirm the html you receive from driver.page_source is different from what you would get when you are browsing yourself?
– Joseph Choi
Nov 23 at 6:33
When I try to connect to the link provided, I get redirected to an unauthorized page vinelink.com/#/unauthorized From my experience and testing, lines after driver.get(url) are only executed after the browser has finished loading. Selenium is designed to emulate web browsing experience same as a human would. Can you confirm the html you receive from driver.page_source is different from what you would get when you are browsing yourself?
– Joseph Choi
Nov 23 at 6:33
Try calling soup.find_all("search-result") to confirm that you are not getting the data you need
– Joseph Choi
Nov 23 at 6:35
Try calling soup.find_all("search-result") to confirm that you are not getting the data you need
– Joseph Choi
Nov 23 at 6:35
Hi Joseph, based on what you've written, I found that I could actually bypass the driver.page_source call, and instead insert driver.implicitly_wait(5) before beginning to scrape data (this allows sufficient time for the browsing emulation to reach the destination page). Thank you very much! I now have a new problem (reCAPTCHA prevents me from collecting data from more than a few of the URLs in my list), but I will create a separate question for this!
– OJT
Nov 23 at 7:12
Hi Joseph, based on what you've written, I found that I could actually bypass the driver.page_source call, and instead insert driver.implicitly_wait(5) before beginning to scrape data (this allows sufficient time for the browsing emulation to reach the destination page). Thank you very much! I now have a new problem (reCAPTCHA prevents me from collecting data from more than a few of the URLs in my list), but I will create a separate question for this!
– OJT
Nov 23 at 7:12
add a comment |
1 Answer
1
active
oldest
votes
Once you have your soup
variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.
Thanks for the suggestion! I think because of something related to the dynamic URL for the search results creation process, I am not winding up with the correct HTML for the eventual destination, so the 'soup' variable does not include the correct HTML and I get the following error: 'TypeError: 'NoneType' object is not subscriptable'. Is there some way to get Selenium to walk through the URL change process and pull the correct page?
– OJT
Nov 23 at 5:30
So when you run your code what out do you get?
– Kamikaze_goldfish
Nov 23 at 17:34
I have actually now solved things per Joseph Choi's suggestion in the comments; I merely inserted a driver.implicitly_wait(5) after loading the original URL, and then the further Selenium commands do not begin until the final destination of the redirect has been reached.
– OJT
Nov 23 at 17:41
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53436135%2fpython-selenium-beautifulsoup-search-result-dynamic-url%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Once you have your soup
variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.
Thanks for the suggestion! I think because of something related to the dynamic URL for the search results creation process, I am not winding up with the correct HTML for the eventual destination, so the 'soup' variable does not include the correct HTML and I get the following error: 'TypeError: 'NoneType' object is not subscriptable'. Is there some way to get Selenium to walk through the URL change process and pull the correct page?
– OJT
Nov 23 at 5:30
So when you run your code what out do you get?
– Kamikaze_goldfish
Nov 23 at 17:34
I have actually now solved things per Joseph Choi's suggestion in the comments; I merely inserted a driver.implicitly_wait(5) after loading the original URL, and then the further Selenium commands do not begin until the final destination of the redirect has been reached.
– OJT
Nov 23 at 17:41
add a comment |
Once you have your soup
variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.
Thanks for the suggestion! I think because of something related to the dynamic URL for the search results creation process, I am not winding up with the correct HTML for the eventual destination, so the 'soup' variable does not include the correct HTML and I get the following error: 'TypeError: 'NoneType' object is not subscriptable'. Is there some way to get Selenium to walk through the URL change process and pull the correct page?
– OJT
Nov 23 at 5:30
So when you run your code what out do you get?
– Kamikaze_goldfish
Nov 23 at 17:34
I have actually now solved things per Joseph Choi's suggestion in the comments; I merely inserted a driver.implicitly_wait(5) after loading the original URL, and then the further Selenium commands do not begin until the final destination of the redirect has been reached.
– OJT
Nov 23 at 17:41
add a comment |
Once you have your soup
variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.
Once you have your soup
variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.
answered Nov 23 at 4:18
Kamikaze_goldfish
463311
463311
Thanks for the suggestion! I think because of something related to the dynamic URL for the search results creation process, I am not winding up with the correct HTML for the eventual destination, so the 'soup' variable does not include the correct HTML and I get the following error: 'TypeError: 'NoneType' object is not subscriptable'. Is there some way to get Selenium to walk through the URL change process and pull the correct page?
– OJT
Nov 23 at 5:30
So when you run your code what out do you get?
– Kamikaze_goldfish
Nov 23 at 17:34
I have actually now solved things per Joseph Choi's suggestion in the comments; I merely inserted a driver.implicitly_wait(5) after loading the original URL, and then the further Selenium commands do not begin until the final destination of the redirect has been reached.
– OJT
Nov 23 at 17:41
add a comment |
Thanks for the suggestion! I think because of something related to the dynamic URL for the search results creation process, I am not winding up with the correct HTML for the eventual destination, so the 'soup' variable does not include the correct HTML and I get the following error: 'TypeError: 'NoneType' object is not subscriptable'. Is there some way to get Selenium to walk through the URL change process and pull the correct page?
– OJT
Nov 23 at 5:30
So when you run your code what out do you get?
– Kamikaze_goldfish
Nov 23 at 17:34
I have actually now solved things per Joseph Choi's suggestion in the comments; I merely inserted a driver.implicitly_wait(5) after loading the original URL, and then the further Selenium commands do not begin until the final destination of the redirect has been reached.
– OJT
Nov 23 at 17:41
Thanks for the suggestion! I think because of something related to the dynamic URL for the search results creation process, I am not winding up with the correct HTML for the eventual destination, so the 'soup' variable does not include the correct HTML and I get the following error: 'TypeError: 'NoneType' object is not subscriptable'. Is there some way to get Selenium to walk through the URL change process and pull the correct page?
– OJT
Nov 23 at 5:30
Thanks for the suggestion! I think because of something related to the dynamic URL for the search results creation process, I am not winding up with the correct HTML for the eventual destination, so the 'soup' variable does not include the correct HTML and I get the following error: 'TypeError: 'NoneType' object is not subscriptable'. Is there some way to get Selenium to walk through the URL change process and pull the correct page?
– OJT
Nov 23 at 5:30
So when you run your code what out do you get?
– Kamikaze_goldfish
Nov 23 at 17:34
So when you run your code what out do you get?
– Kamikaze_goldfish
Nov 23 at 17:34
I have actually now solved things per Joseph Choi's suggestion in the comments; I merely inserted a driver.implicitly_wait(5) after loading the original URL, and then the further Selenium commands do not begin until the final destination of the redirect has been reached.
– OJT
Nov 23 at 17:41
I have actually now solved things per Joseph Choi's suggestion in the comments; I merely inserted a driver.implicitly_wait(5) after loading the original URL, and then the further Selenium commands do not begin until the final destination of the redirect has been reached.
– OJT
Nov 23 at 17:41
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53436135%2fpython-selenium-beautifulsoup-search-result-dynamic-url%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What are the components that you are trying to access? (but not available) Selenium should load all javascript before returning html object to be parsed by BeautifulSoup
– Joseph Choi
Nov 23 at 4:30
Hi Joseph, I'm trying to access <search-result> tags from the final destination page (per my question, if I were to enter one of the original URLs into my Chrome search bar, the page would load sequentially and I would see the URL change twice, until it landed on '/#/searchResults/1' (the same URL no matter the offender being searched) -- any idea how to ensure Selenium does not pull data from the first URL in the series of redirects?
– OJT
Nov 23 at 5:32
When I try to connect to the link provided, I get redirected to an unauthorized page vinelink.com/#/unauthorized From my experience and testing, lines after driver.get(url) are only executed after the browser has finished loading. Selenium is designed to emulate web browsing experience same as a human would. Can you confirm the html you receive from driver.page_source is different from what you would get when you are browsing yourself?
– Joseph Choi
Nov 23 at 6:33
Try calling soup.find_all("search-result") to confirm that you are not getting the data you need
– Joseph Choi
Nov 23 at 6:35
Hi Joseph, based on what you've written, I found that I could actually bypass the driver.page_source call, and instead insert driver.implicitly_wait(5) before beginning to scrape data (this allows sufficient time for the browsing emulation to reach the destination page). Thank you very much! I now have a new problem (reCAPTCHA prevents me from collecting data from more than a few of the URLs in my list), but I will create a separate question for this!
– OJT
Nov 23 at 7:12