Web Scraping: How to get info from dynamic pages?
I'm newbie in web scraping. I know how to get data from an HTML or from a JSON but there is a place where I can't know how to do it. I would like to get the positions of points and X's that you can see in the short chart of this page.
http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart
How can I do that?
html json python-3.x web-scraping beautifulsoup
add a comment |
I'm newbie in web scraping. I know how to get data from an HTML or from a JSON but there is a place where I can't know how to do it. I would like to get the positions of points and X's that you can see in the short chart of this page.
http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart
How can I do that?
html json python-3.x web-scraping beautifulsoup
add a comment |
I'm newbie in web scraping. I know how to get data from an HTML or from a JSON but there is a place where I can't know how to do it. I would like to get the positions of points and X's that you can see in the short chart of this page.
http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart
How can I do that?
html json python-3.x web-scraping beautifulsoup
I'm newbie in web scraping. I know how to get data from an HTML or from a JSON but there is a place where I can't know how to do it. I would like to get the positions of points and X's that you can see in the short chart of this page.
http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart
How can I do that?
html json python-3.x web-scraping beautifulsoup
html json python-3.x web-scraping beautifulsoup
edited Nov 23 '18 at 14:06
asked Nov 23 '18 at 2:19
José Carlos
68321944
68321944
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
I'm fairly new as well, but learning as I go. It looks like this page is dynamic, so you'd need to use Selenium to load the page first, before grabbing the html with beautifulsoup to get the x and y coordinates from the Made Shots and Missed shots. So I gave it a shot and was able to get a dataframe with the x, y coords along with if it was 'made' or 'miss'.
I plotted it afterwards just to check to see if it matched, and it appears to be flipped about the x-axis. I believe this is because when you plot on a chart like this graphically, the top, left corner is your (0,0). So your y coordinates are going to be opposite when you want to plot it. I could be wrong though.
None the less, here's the code I used.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
made_shots = soup.findAll("svg", {"class": "shot-hit icon icon-point clickable"})
missed_shots = soup.findAll("svg", {"class": "shot-miss icon icon-miss clickable"})
def get_coordiantes(element, label):
results = pd.DataFrame()
for point in element:
x_point = float(point.get('x'))
y_point = float(point.get('y'))
marker = label
temp_df = pd.DataFrame([[x_point, y_point, marker]], columns=['x','y','marker'])
results = results.append(temp_df)
return results
made_results = get_coordiantes(made_shots, 'made')
missed_results = get_coordiantes(missed_shots, 'missed')
results = made_results.append(missed_results)
results = results.reset_index(drop=True)
results['y'] = results['y'] * -1
driver.close()
gives this output:
In [6]:results.head(5)
Out[6]:
x y marker
0 33.0 -107.0 made
1 159.0 -160.0 made
2 143.0 -197.0 made
3 38.0 -113.0 made
4 65.0 -130.0 made
and when I plot it:
import seaborn as sns
import numpy as np
# Add a column: the color depends of x and y values, but you can use whatever function.
value=(results['marker'] == 'made')
results['color']= np.where( value==True , "green", "red")
# plot
sns.regplot(data=results, x="x", y="y", fit_reg=False, scatter_kws={'facecolors':results['color']})
ADDITIONAL: I'm sure there's a better, more efficient, cleaner way to code this up. But just doing it on the fly, came up with this. It should get you going. Feel free to dive into it and look at the html source code to start seeing how it's grabbing the different data. have fun.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
###############################################################################
shots = soup.findAll("g", {"class": "shot-item"})
results = pd.DataFrame()
for point in shots:
hit = point.get('data-play-by-play-action-hit')
action_id = point.get('data-play-by-play-action-id')
period = point.get('data-play-by-play-action-period')
player_id = point.get('data-play-by-play-action-player-id')
team_id = point.get('data-play-by-play-action-team-id')
x_point = float(point.find('svg').get('x'))
y_point = float(point.find('svg').get('y'))
temp_df = pd.DataFrame([[hit, action_id, period, player_id, team_id, x_point, y_point]],
columns=['hit','action_id','period','player_id','team_id','x','y'])
results = results.append(temp_df)
results['y'] = results['y'] * -1
results = results.reset_index(drop=True)
###############################################################################
player_ids = soup.findAll('label', {"class": "item-label"})
players = pd.DataFrame()
for player in player_ids:
player_id = player.find('input').get('data-play-by-play-action-player-id')
if player_id == None:
continue
player_name = player.find('span').text
temp_df = pd.DataFrame([[player_id, player_name]],
columns=['player_id','player_name'])
players = players.append(temp_df)
players = players.reset_index(drop=True)
###############################################################################
team_ids = soup.findAll('div', {"class": "header-scores_desktop"})
teams_A = team_ids[0].find('div', {"class": "team-A"})
team_id_A = teams_A.find('img').get('src').rsplit('/')[-1]
team_name_A = teams_A.find('span').text
teams_B = team_ids[0].find('div', {"class": "team-B"})
team_id_B = teams_B.find('img').get('src').rsplit('/')[-1]
team_name_B = teams_B.find('span').text
teams = pd.DataFrame([[team_id_A, team_name_A],[team_id_B,team_name_B]],
columns=['team_id','team_name'])
teams = teams.reset_index(drop=True)
###############################################################################
actions = pd.DataFrame()
action_ids = soup.findAll('div', {"class": "overlay-wrapper"})
for action in action_ids:
action_id = action.get('data-play-by-play-action-id')
time_remaining = action.find('div').find('span', {'class': 'time'}).text
full_name = action.find('div').find('span', {'class': 'athlete-name'}).text
if not action.find('div').find('span', {'class': 'action-code'}):
result_of_action = '+0'
else:
result_of_action = action.find('div').find('span', {'class': 'action-code'}).text
action_description = action.find('div').find('span', {'class': 'action-description'}).text
team_A_score = action.find('div').find('span', {'class': 'team-A'}).text
team_B_score = action.find('div').find('span', {'class': 'team-B'}).text
temp_df = pd.DataFrame([[action_id, time_remaining, full_name, result_of_action, team_A_score, team_B_score, action_description]],
columns=['action_id','time_remaining', 'full_name', 'result_of_action', team_name_A+'_score', team_name_B+' score', 'action-description'])
actions = actions.append(temp_df)
actions = actions.reset_index(drop=True)
###############################################################################
results = pd.merge(results, players, how='left', on='player_id')
results = pd.merge(results, teams, how='left', on='team_id')
results = pd.merge(results, actions, how='left', on='action_id')
driver.close()
And to clean it a bit, you can sort the rows so that they are in order, play-by-play from start to finish
results.sort_values(['period', 'time_remaining'], ascending=[True, False], inplace=True)
results = results.reset_index(drop=True)
OMG!!! What a wonderful answer and job!!! Thank you so much!!! Do you know if it's possible to know the player who made the shot and the time when is made it? Thank you, thank you, thank you!!!
– José Carlos
Nov 23 '18 at 13:35
I've got a question ... How can you find that these "shot-hit icon icon-point clickable" are the classes to seek? After "soup = bs4.BeautifulSoup(html,'html.parser')" have you print the code and search in them?
– José Carlos
Nov 23 '18 at 13:40
1
you could do that just search through theprint (soup)
, but it's messy. Sometimes I'll just paste it to notepad++ and look. But it's easier I think to right click on the site, and 'Inspect' and click around in there to see how it's structured and what tags they use. There's a bunch of video tutorials out there. I'll admit, it's confusing at first, but practice with it makes it a bit easier...I'm still learning. It might be possible to also grab the player name and time. I'll look through it now.
– chitown88
Nov 23 '18 at 13:43
Thank you @chitown88 for the help added!!!
– José Carlos
Nov 23 '18 at 19:01
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439922%2fweb-scraping-how-to-get-info-from-dynamic-pages%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I'm fairly new as well, but learning as I go. It looks like this page is dynamic, so you'd need to use Selenium to load the page first, before grabbing the html with beautifulsoup to get the x and y coordinates from the Made Shots and Missed shots. So I gave it a shot and was able to get a dataframe with the x, y coords along with if it was 'made' or 'miss'.
I plotted it afterwards just to check to see if it matched, and it appears to be flipped about the x-axis. I believe this is because when you plot on a chart like this graphically, the top, left corner is your (0,0). So your y coordinates are going to be opposite when you want to plot it. I could be wrong though.
None the less, here's the code I used.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
made_shots = soup.findAll("svg", {"class": "shot-hit icon icon-point clickable"})
missed_shots = soup.findAll("svg", {"class": "shot-miss icon icon-miss clickable"})
def get_coordiantes(element, label):
results = pd.DataFrame()
for point in element:
x_point = float(point.get('x'))
y_point = float(point.get('y'))
marker = label
temp_df = pd.DataFrame([[x_point, y_point, marker]], columns=['x','y','marker'])
results = results.append(temp_df)
return results
made_results = get_coordiantes(made_shots, 'made')
missed_results = get_coordiantes(missed_shots, 'missed')
results = made_results.append(missed_results)
results = results.reset_index(drop=True)
results['y'] = results['y'] * -1
driver.close()
gives this output:
In [6]:results.head(5)
Out[6]:
x y marker
0 33.0 -107.0 made
1 159.0 -160.0 made
2 143.0 -197.0 made
3 38.0 -113.0 made
4 65.0 -130.0 made
and when I plot it:
import seaborn as sns
import numpy as np
# Add a column: the color depends of x and y values, but you can use whatever function.
value=(results['marker'] == 'made')
results['color']= np.where( value==True , "green", "red")
# plot
sns.regplot(data=results, x="x", y="y", fit_reg=False, scatter_kws={'facecolors':results['color']})
ADDITIONAL: I'm sure there's a better, more efficient, cleaner way to code this up. But just doing it on the fly, came up with this. It should get you going. Feel free to dive into it and look at the html source code to start seeing how it's grabbing the different data. have fun.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
###############################################################################
shots = soup.findAll("g", {"class": "shot-item"})
results = pd.DataFrame()
for point in shots:
hit = point.get('data-play-by-play-action-hit')
action_id = point.get('data-play-by-play-action-id')
period = point.get('data-play-by-play-action-period')
player_id = point.get('data-play-by-play-action-player-id')
team_id = point.get('data-play-by-play-action-team-id')
x_point = float(point.find('svg').get('x'))
y_point = float(point.find('svg').get('y'))
temp_df = pd.DataFrame([[hit, action_id, period, player_id, team_id, x_point, y_point]],
columns=['hit','action_id','period','player_id','team_id','x','y'])
results = results.append(temp_df)
results['y'] = results['y'] * -1
results = results.reset_index(drop=True)
###############################################################################
player_ids = soup.findAll('label', {"class": "item-label"})
players = pd.DataFrame()
for player in player_ids:
player_id = player.find('input').get('data-play-by-play-action-player-id')
if player_id == None:
continue
player_name = player.find('span').text
temp_df = pd.DataFrame([[player_id, player_name]],
columns=['player_id','player_name'])
players = players.append(temp_df)
players = players.reset_index(drop=True)
###############################################################################
team_ids = soup.findAll('div', {"class": "header-scores_desktop"})
teams_A = team_ids[0].find('div', {"class": "team-A"})
team_id_A = teams_A.find('img').get('src').rsplit('/')[-1]
team_name_A = teams_A.find('span').text
teams_B = team_ids[0].find('div', {"class": "team-B"})
team_id_B = teams_B.find('img').get('src').rsplit('/')[-1]
team_name_B = teams_B.find('span').text
teams = pd.DataFrame([[team_id_A, team_name_A],[team_id_B,team_name_B]],
columns=['team_id','team_name'])
teams = teams.reset_index(drop=True)
###############################################################################
actions = pd.DataFrame()
action_ids = soup.findAll('div', {"class": "overlay-wrapper"})
for action in action_ids:
action_id = action.get('data-play-by-play-action-id')
time_remaining = action.find('div').find('span', {'class': 'time'}).text
full_name = action.find('div').find('span', {'class': 'athlete-name'}).text
if not action.find('div').find('span', {'class': 'action-code'}):
result_of_action = '+0'
else:
result_of_action = action.find('div').find('span', {'class': 'action-code'}).text
action_description = action.find('div').find('span', {'class': 'action-description'}).text
team_A_score = action.find('div').find('span', {'class': 'team-A'}).text
team_B_score = action.find('div').find('span', {'class': 'team-B'}).text
temp_df = pd.DataFrame([[action_id, time_remaining, full_name, result_of_action, team_A_score, team_B_score, action_description]],
columns=['action_id','time_remaining', 'full_name', 'result_of_action', team_name_A+'_score', team_name_B+' score', 'action-description'])
actions = actions.append(temp_df)
actions = actions.reset_index(drop=True)
###############################################################################
results = pd.merge(results, players, how='left', on='player_id')
results = pd.merge(results, teams, how='left', on='team_id')
results = pd.merge(results, actions, how='left', on='action_id')
driver.close()
And to clean it a bit, you can sort the rows so that they are in order, play-by-play from start to finish
results.sort_values(['period', 'time_remaining'], ascending=[True, False], inplace=True)
results = results.reset_index(drop=True)
OMG!!! What a wonderful answer and job!!! Thank you so much!!! Do you know if it's possible to know the player who made the shot and the time when is made it? Thank you, thank you, thank you!!!
– José Carlos
Nov 23 '18 at 13:35
I've got a question ... How can you find that these "shot-hit icon icon-point clickable" are the classes to seek? After "soup = bs4.BeautifulSoup(html,'html.parser')" have you print the code and search in them?
– José Carlos
Nov 23 '18 at 13:40
1
you could do that just search through theprint (soup)
, but it's messy. Sometimes I'll just paste it to notepad++ and look. But it's easier I think to right click on the site, and 'Inspect' and click around in there to see how it's structured and what tags they use. There's a bunch of video tutorials out there. I'll admit, it's confusing at first, but practice with it makes it a bit easier...I'm still learning. It might be possible to also grab the player name and time. I'll look through it now.
– chitown88
Nov 23 '18 at 13:43
Thank you @chitown88 for the help added!!!
– José Carlos
Nov 23 '18 at 19:01
add a comment |
I'm fairly new as well, but learning as I go. It looks like this page is dynamic, so you'd need to use Selenium to load the page first, before grabbing the html with beautifulsoup to get the x and y coordinates from the Made Shots and Missed shots. So I gave it a shot and was able to get a dataframe with the x, y coords along with if it was 'made' or 'miss'.
I plotted it afterwards just to check to see if it matched, and it appears to be flipped about the x-axis. I believe this is because when you plot on a chart like this graphically, the top, left corner is your (0,0). So your y coordinates are going to be opposite when you want to plot it. I could be wrong though.
None the less, here's the code I used.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
made_shots = soup.findAll("svg", {"class": "shot-hit icon icon-point clickable"})
missed_shots = soup.findAll("svg", {"class": "shot-miss icon icon-miss clickable"})
def get_coordiantes(element, label):
results = pd.DataFrame()
for point in element:
x_point = float(point.get('x'))
y_point = float(point.get('y'))
marker = label
temp_df = pd.DataFrame([[x_point, y_point, marker]], columns=['x','y','marker'])
results = results.append(temp_df)
return results
made_results = get_coordiantes(made_shots, 'made')
missed_results = get_coordiantes(missed_shots, 'missed')
results = made_results.append(missed_results)
results = results.reset_index(drop=True)
results['y'] = results['y'] * -1
driver.close()
gives this output:
In [6]:results.head(5)
Out[6]:
x y marker
0 33.0 -107.0 made
1 159.0 -160.0 made
2 143.0 -197.0 made
3 38.0 -113.0 made
4 65.0 -130.0 made
and when I plot it:
import seaborn as sns
import numpy as np
# Add a column: the color depends of x and y values, but you can use whatever function.
value=(results['marker'] == 'made')
results['color']= np.where( value==True , "green", "red")
# plot
sns.regplot(data=results, x="x", y="y", fit_reg=False, scatter_kws={'facecolors':results['color']})
ADDITIONAL: I'm sure there's a better, more efficient, cleaner way to code this up. But just doing it on the fly, came up with this. It should get you going. Feel free to dive into it and look at the html source code to start seeing how it's grabbing the different data. have fun.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
###############################################################################
shots = soup.findAll("g", {"class": "shot-item"})
results = pd.DataFrame()
for point in shots:
hit = point.get('data-play-by-play-action-hit')
action_id = point.get('data-play-by-play-action-id')
period = point.get('data-play-by-play-action-period')
player_id = point.get('data-play-by-play-action-player-id')
team_id = point.get('data-play-by-play-action-team-id')
x_point = float(point.find('svg').get('x'))
y_point = float(point.find('svg').get('y'))
temp_df = pd.DataFrame([[hit, action_id, period, player_id, team_id, x_point, y_point]],
columns=['hit','action_id','period','player_id','team_id','x','y'])
results = results.append(temp_df)
results['y'] = results['y'] * -1
results = results.reset_index(drop=True)
###############################################################################
player_ids = soup.findAll('label', {"class": "item-label"})
players = pd.DataFrame()
for player in player_ids:
player_id = player.find('input').get('data-play-by-play-action-player-id')
if player_id == None:
continue
player_name = player.find('span').text
temp_df = pd.DataFrame([[player_id, player_name]],
columns=['player_id','player_name'])
players = players.append(temp_df)
players = players.reset_index(drop=True)
###############################################################################
team_ids = soup.findAll('div', {"class": "header-scores_desktop"})
teams_A = team_ids[0].find('div', {"class": "team-A"})
team_id_A = teams_A.find('img').get('src').rsplit('/')[-1]
team_name_A = teams_A.find('span').text
teams_B = team_ids[0].find('div', {"class": "team-B"})
team_id_B = teams_B.find('img').get('src').rsplit('/')[-1]
team_name_B = teams_B.find('span').text
teams = pd.DataFrame([[team_id_A, team_name_A],[team_id_B,team_name_B]],
columns=['team_id','team_name'])
teams = teams.reset_index(drop=True)
###############################################################################
actions = pd.DataFrame()
action_ids = soup.findAll('div', {"class": "overlay-wrapper"})
for action in action_ids:
action_id = action.get('data-play-by-play-action-id')
time_remaining = action.find('div').find('span', {'class': 'time'}).text
full_name = action.find('div').find('span', {'class': 'athlete-name'}).text
if not action.find('div').find('span', {'class': 'action-code'}):
result_of_action = '+0'
else:
result_of_action = action.find('div').find('span', {'class': 'action-code'}).text
action_description = action.find('div').find('span', {'class': 'action-description'}).text
team_A_score = action.find('div').find('span', {'class': 'team-A'}).text
team_B_score = action.find('div').find('span', {'class': 'team-B'}).text
temp_df = pd.DataFrame([[action_id, time_remaining, full_name, result_of_action, team_A_score, team_B_score, action_description]],
columns=['action_id','time_remaining', 'full_name', 'result_of_action', team_name_A+'_score', team_name_B+' score', 'action-description'])
actions = actions.append(temp_df)
actions = actions.reset_index(drop=True)
###############################################################################
results = pd.merge(results, players, how='left', on='player_id')
results = pd.merge(results, teams, how='left', on='team_id')
results = pd.merge(results, actions, how='left', on='action_id')
driver.close()
And to clean it a bit, you can sort the rows so that they are in order, play-by-play from start to finish
results.sort_values(['period', 'time_remaining'], ascending=[True, False], inplace=True)
results = results.reset_index(drop=True)
OMG!!! What a wonderful answer and job!!! Thank you so much!!! Do you know if it's possible to know the player who made the shot and the time when is made it? Thank you, thank you, thank you!!!
– José Carlos
Nov 23 '18 at 13:35
I've got a question ... How can you find that these "shot-hit icon icon-point clickable" are the classes to seek? After "soup = bs4.BeautifulSoup(html,'html.parser')" have you print the code and search in them?
– José Carlos
Nov 23 '18 at 13:40
1
you could do that just search through theprint (soup)
, but it's messy. Sometimes I'll just paste it to notepad++ and look. But it's easier I think to right click on the site, and 'Inspect' and click around in there to see how it's structured and what tags they use. There's a bunch of video tutorials out there. I'll admit, it's confusing at first, but practice with it makes it a bit easier...I'm still learning. It might be possible to also grab the player name and time. I'll look through it now.
– chitown88
Nov 23 '18 at 13:43
Thank you @chitown88 for the help added!!!
– José Carlos
Nov 23 '18 at 19:01
add a comment |
I'm fairly new as well, but learning as I go. It looks like this page is dynamic, so you'd need to use Selenium to load the page first, before grabbing the html with beautifulsoup to get the x and y coordinates from the Made Shots and Missed shots. So I gave it a shot and was able to get a dataframe with the x, y coords along with if it was 'made' or 'miss'.
I plotted it afterwards just to check to see if it matched, and it appears to be flipped about the x-axis. I believe this is because when you plot on a chart like this graphically, the top, left corner is your (0,0). So your y coordinates are going to be opposite when you want to plot it. I could be wrong though.
None the less, here's the code I used.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
made_shots = soup.findAll("svg", {"class": "shot-hit icon icon-point clickable"})
missed_shots = soup.findAll("svg", {"class": "shot-miss icon icon-miss clickable"})
def get_coordiantes(element, label):
results = pd.DataFrame()
for point in element:
x_point = float(point.get('x'))
y_point = float(point.get('y'))
marker = label
temp_df = pd.DataFrame([[x_point, y_point, marker]], columns=['x','y','marker'])
results = results.append(temp_df)
return results
made_results = get_coordiantes(made_shots, 'made')
missed_results = get_coordiantes(missed_shots, 'missed')
results = made_results.append(missed_results)
results = results.reset_index(drop=True)
results['y'] = results['y'] * -1
driver.close()
gives this output:
In [6]:results.head(5)
Out[6]:
x y marker
0 33.0 -107.0 made
1 159.0 -160.0 made
2 143.0 -197.0 made
3 38.0 -113.0 made
4 65.0 -130.0 made
and when I plot it:
import seaborn as sns
import numpy as np
# Add a column: the color depends of x and y values, but you can use whatever function.
value=(results['marker'] == 'made')
results['color']= np.where( value==True , "green", "red")
# plot
sns.regplot(data=results, x="x", y="y", fit_reg=False, scatter_kws={'facecolors':results['color']})
ADDITIONAL: I'm sure there's a better, more efficient, cleaner way to code this up. But just doing it on the fly, came up with this. It should get you going. Feel free to dive into it and look at the html source code to start seeing how it's grabbing the different data. have fun.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
###############################################################################
shots = soup.findAll("g", {"class": "shot-item"})
results = pd.DataFrame()
for point in shots:
hit = point.get('data-play-by-play-action-hit')
action_id = point.get('data-play-by-play-action-id')
period = point.get('data-play-by-play-action-period')
player_id = point.get('data-play-by-play-action-player-id')
team_id = point.get('data-play-by-play-action-team-id')
x_point = float(point.find('svg').get('x'))
y_point = float(point.find('svg').get('y'))
temp_df = pd.DataFrame([[hit, action_id, period, player_id, team_id, x_point, y_point]],
columns=['hit','action_id','period','player_id','team_id','x','y'])
results = results.append(temp_df)
results['y'] = results['y'] * -1
results = results.reset_index(drop=True)
###############################################################################
player_ids = soup.findAll('label', {"class": "item-label"})
players = pd.DataFrame()
for player in player_ids:
player_id = player.find('input').get('data-play-by-play-action-player-id')
if player_id == None:
continue
player_name = player.find('span').text
temp_df = pd.DataFrame([[player_id, player_name]],
columns=['player_id','player_name'])
players = players.append(temp_df)
players = players.reset_index(drop=True)
###############################################################################
team_ids = soup.findAll('div', {"class": "header-scores_desktop"})
teams_A = team_ids[0].find('div', {"class": "team-A"})
team_id_A = teams_A.find('img').get('src').rsplit('/')[-1]
team_name_A = teams_A.find('span').text
teams_B = team_ids[0].find('div', {"class": "team-B"})
team_id_B = teams_B.find('img').get('src').rsplit('/')[-1]
team_name_B = teams_B.find('span').text
teams = pd.DataFrame([[team_id_A, team_name_A],[team_id_B,team_name_B]],
columns=['team_id','team_name'])
teams = teams.reset_index(drop=True)
###############################################################################
actions = pd.DataFrame()
action_ids = soup.findAll('div', {"class": "overlay-wrapper"})
for action in action_ids:
action_id = action.get('data-play-by-play-action-id')
time_remaining = action.find('div').find('span', {'class': 'time'}).text
full_name = action.find('div').find('span', {'class': 'athlete-name'}).text
if not action.find('div').find('span', {'class': 'action-code'}):
result_of_action = '+0'
else:
result_of_action = action.find('div').find('span', {'class': 'action-code'}).text
action_description = action.find('div').find('span', {'class': 'action-description'}).text
team_A_score = action.find('div').find('span', {'class': 'team-A'}).text
team_B_score = action.find('div').find('span', {'class': 'team-B'}).text
temp_df = pd.DataFrame([[action_id, time_remaining, full_name, result_of_action, team_A_score, team_B_score, action_description]],
columns=['action_id','time_remaining', 'full_name', 'result_of_action', team_name_A+'_score', team_name_B+' score', 'action-description'])
actions = actions.append(temp_df)
actions = actions.reset_index(drop=True)
###############################################################################
results = pd.merge(results, players, how='left', on='player_id')
results = pd.merge(results, teams, how='left', on='team_id')
results = pd.merge(results, actions, how='left', on='action_id')
driver.close()
And to clean it a bit, you can sort the rows so that they are in order, play-by-play from start to finish
results.sort_values(['period', 'time_remaining'], ascending=[True, False], inplace=True)
results = results.reset_index(drop=True)
I'm fairly new as well, but learning as I go. It looks like this page is dynamic, so you'd need to use Selenium to load the page first, before grabbing the html with beautifulsoup to get the x and y coordinates from the Made Shots and Missed shots. So I gave it a shot and was able to get a dataframe with the x, y coords along with if it was 'made' or 'miss'.
I plotted it afterwards just to check to see if it matched, and it appears to be flipped about the x-axis. I believe this is because when you plot on a chart like this graphically, the top, left corner is your (0,0). So your y coordinates are going to be opposite when you want to plot it. I could be wrong though.
None the less, here's the code I used.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
made_shots = soup.findAll("svg", {"class": "shot-hit icon icon-point clickable"})
missed_shots = soup.findAll("svg", {"class": "shot-miss icon icon-miss clickable"})
def get_coordiantes(element, label):
results = pd.DataFrame()
for point in element:
x_point = float(point.get('x'))
y_point = float(point.get('y'))
marker = label
temp_df = pd.DataFrame([[x_point, y_point, marker]], columns=['x','y','marker'])
results = results.append(temp_df)
return results
made_results = get_coordiantes(made_shots, 'made')
missed_results = get_coordiantes(missed_shots, 'missed')
results = made_results.append(missed_results)
results = results.reset_index(drop=True)
results['y'] = results['y'] * -1
driver.close()
gives this output:
In [6]:results.head(5)
Out[6]:
x y marker
0 33.0 -107.0 made
1 159.0 -160.0 made
2 143.0 -197.0 made
3 38.0 -113.0 made
4 65.0 -130.0 made
and when I plot it:
import seaborn as sns
import numpy as np
# Add a column: the color depends of x and y values, but you can use whatever function.
value=(results['marker'] == 'made')
results['color']= np.where( value==True , "green", "red")
# plot
sns.regplot(data=results, x="x", y="y", fit_reg=False, scatter_kws={'facecolors':results['color']})
ADDITIONAL: I'm sure there's a better, more efficient, cleaner way to code this up. But just doing it on the fly, came up with this. It should get you going. Feel free to dive into it and look at the html source code to start seeing how it's grabbing the different data. have fun.
import pandas as pd
import bs4
from selenium import webdriver
driver = webdriver.Chrome('C:chromedriver_win32chromedriver.exe')
driver.get('http://www.fiba.basketball/euroleaguewomen/18-19/game/2410/Nadezhda-ZVVZ-USK-Praha#|tab=shot_chart')
html = driver.page_source
soup = bs4.BeautifulSoup(html,'html.parser')
###############################################################################
shots = soup.findAll("g", {"class": "shot-item"})
results = pd.DataFrame()
for point in shots:
hit = point.get('data-play-by-play-action-hit')
action_id = point.get('data-play-by-play-action-id')
period = point.get('data-play-by-play-action-period')
player_id = point.get('data-play-by-play-action-player-id')
team_id = point.get('data-play-by-play-action-team-id')
x_point = float(point.find('svg').get('x'))
y_point = float(point.find('svg').get('y'))
temp_df = pd.DataFrame([[hit, action_id, period, player_id, team_id, x_point, y_point]],
columns=['hit','action_id','period','player_id','team_id','x','y'])
results = results.append(temp_df)
results['y'] = results['y'] * -1
results = results.reset_index(drop=True)
###############################################################################
player_ids = soup.findAll('label', {"class": "item-label"})
players = pd.DataFrame()
for player in player_ids:
player_id = player.find('input').get('data-play-by-play-action-player-id')
if player_id == None:
continue
player_name = player.find('span').text
temp_df = pd.DataFrame([[player_id, player_name]],
columns=['player_id','player_name'])
players = players.append(temp_df)
players = players.reset_index(drop=True)
###############################################################################
team_ids = soup.findAll('div', {"class": "header-scores_desktop"})
teams_A = team_ids[0].find('div', {"class": "team-A"})
team_id_A = teams_A.find('img').get('src').rsplit('/')[-1]
team_name_A = teams_A.find('span').text
teams_B = team_ids[0].find('div', {"class": "team-B"})
team_id_B = teams_B.find('img').get('src').rsplit('/')[-1]
team_name_B = teams_B.find('span').text
teams = pd.DataFrame([[team_id_A, team_name_A],[team_id_B,team_name_B]],
columns=['team_id','team_name'])
teams = teams.reset_index(drop=True)
###############################################################################
actions = pd.DataFrame()
action_ids = soup.findAll('div', {"class": "overlay-wrapper"})
for action in action_ids:
action_id = action.get('data-play-by-play-action-id')
time_remaining = action.find('div').find('span', {'class': 'time'}).text
full_name = action.find('div').find('span', {'class': 'athlete-name'}).text
if not action.find('div').find('span', {'class': 'action-code'}):
result_of_action = '+0'
else:
result_of_action = action.find('div').find('span', {'class': 'action-code'}).text
action_description = action.find('div').find('span', {'class': 'action-description'}).text
team_A_score = action.find('div').find('span', {'class': 'team-A'}).text
team_B_score = action.find('div').find('span', {'class': 'team-B'}).text
temp_df = pd.DataFrame([[action_id, time_remaining, full_name, result_of_action, team_A_score, team_B_score, action_description]],
columns=['action_id','time_remaining', 'full_name', 'result_of_action', team_name_A+'_score', team_name_B+' score', 'action-description'])
actions = actions.append(temp_df)
actions = actions.reset_index(drop=True)
###############################################################################
results = pd.merge(results, players, how='left', on='player_id')
results = pd.merge(results, teams, how='left', on='team_id')
results = pd.merge(results, actions, how='left', on='action_id')
driver.close()
And to clean it a bit, you can sort the rows so that they are in order, play-by-play from start to finish
results.sort_values(['period', 'time_remaining'], ascending=[True, False], inplace=True)
results = results.reset_index(drop=True)
edited Nov 24 '18 at 11:45
answered Nov 23 '18 at 12:41
chitown88
1,6741314
1,6741314
OMG!!! What a wonderful answer and job!!! Thank you so much!!! Do you know if it's possible to know the player who made the shot and the time when is made it? Thank you, thank you, thank you!!!
– José Carlos
Nov 23 '18 at 13:35
I've got a question ... How can you find that these "shot-hit icon icon-point clickable" are the classes to seek? After "soup = bs4.BeautifulSoup(html,'html.parser')" have you print the code and search in them?
– José Carlos
Nov 23 '18 at 13:40
1
you could do that just search through theprint (soup)
, but it's messy. Sometimes I'll just paste it to notepad++ and look. But it's easier I think to right click on the site, and 'Inspect' and click around in there to see how it's structured and what tags they use. There's a bunch of video tutorials out there. I'll admit, it's confusing at first, but practice with it makes it a bit easier...I'm still learning. It might be possible to also grab the player name and time. I'll look through it now.
– chitown88
Nov 23 '18 at 13:43
Thank you @chitown88 for the help added!!!
– José Carlos
Nov 23 '18 at 19:01
add a comment |
OMG!!! What a wonderful answer and job!!! Thank you so much!!! Do you know if it's possible to know the player who made the shot and the time when is made it? Thank you, thank you, thank you!!!
– José Carlos
Nov 23 '18 at 13:35
I've got a question ... How can you find that these "shot-hit icon icon-point clickable" are the classes to seek? After "soup = bs4.BeautifulSoup(html,'html.parser')" have you print the code and search in them?
– José Carlos
Nov 23 '18 at 13:40
1
you could do that just search through theprint (soup)
, but it's messy. Sometimes I'll just paste it to notepad++ and look. But it's easier I think to right click on the site, and 'Inspect' and click around in there to see how it's structured and what tags they use. There's a bunch of video tutorials out there. I'll admit, it's confusing at first, but practice with it makes it a bit easier...I'm still learning. It might be possible to also grab the player name and time. I'll look through it now.
– chitown88
Nov 23 '18 at 13:43
Thank you @chitown88 for the help added!!!
– José Carlos
Nov 23 '18 at 19:01
OMG!!! What a wonderful answer and job!!! Thank you so much!!! Do you know if it's possible to know the player who made the shot and the time when is made it? Thank you, thank you, thank you!!!
– José Carlos
Nov 23 '18 at 13:35
OMG!!! What a wonderful answer and job!!! Thank you so much!!! Do you know if it's possible to know the player who made the shot and the time when is made it? Thank you, thank you, thank you!!!
– José Carlos
Nov 23 '18 at 13:35
I've got a question ... How can you find that these "shot-hit icon icon-point clickable" are the classes to seek? After "soup = bs4.BeautifulSoup(html,'html.parser')" have you print the code and search in them?
– José Carlos
Nov 23 '18 at 13:40
I've got a question ... How can you find that these "shot-hit icon icon-point clickable" are the classes to seek? After "soup = bs4.BeautifulSoup(html,'html.parser')" have you print the code and search in them?
– José Carlos
Nov 23 '18 at 13:40
1
1
you could do that just search through the
print (soup)
, but it's messy. Sometimes I'll just paste it to notepad++ and look. But it's easier I think to right click on the site, and 'Inspect' and click around in there to see how it's structured and what tags they use. There's a bunch of video tutorials out there. I'll admit, it's confusing at first, but practice with it makes it a bit easier...I'm still learning. It might be possible to also grab the player name and time. I'll look through it now.– chitown88
Nov 23 '18 at 13:43
you could do that just search through the
print (soup)
, but it's messy. Sometimes I'll just paste it to notepad++ and look. But it's easier I think to right click on the site, and 'Inspect' and click around in there to see how it's structured and what tags they use. There's a bunch of video tutorials out there. I'll admit, it's confusing at first, but practice with it makes it a bit easier...I'm still learning. It might be possible to also grab the player name and time. I'll look through it now.– chitown88
Nov 23 '18 at 13:43
Thank you @chitown88 for the help added!!!
– José Carlos
Nov 23 '18 at 19:01
Thank you @chitown88 for the help added!!!
– José Carlos
Nov 23 '18 at 19:01
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53439922%2fweb-scraping-how-to-get-info-from-dynamic-pages%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown