[FIXED] Beim Scraping von Text mit BeautifulSoup und Selenium scrapiere ich nicht vom zweiten Tab (innerhalb der Seite)

Ausgabe

Ich versuche, Play-by-Play-Ergebnisdaten von ESPN für NBA zu kratzen. Die Daten sind vierteljährlich, also verwende ich Selenium, um zwischen den Tabs/Quartalen zu wechseln, aber es kratzt immer die gleichen Daten (die für das erste Quartal). Code:

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.ui import WebDriverWait
import requests

url = "https://www.espn.com/nba/playbyplay/_/gameId/401474876"

browser = webdriver.Chrome()
browser.get(url)

html_text = requests.get(url).text
soup = BeautifulSoup(html_text, "lxml")

minutes = list()
a_score = list()
b_score = list()

a_score_raw = soup.find_all("td", class_="playByPlay__score playByPlay__score--away tr Table__TD")
b_score_raw = soup.find_all("td", class_="playByPlay__score playByPlay__score--home tr Table__TD")
minutes_raw = soup.find_all("td", class_="playByPlay__time Table__TD")

for i in range(len(a_score_raw)):
    a_score.append(a_score_raw[i].text)
    b_score.append(b_score_raw[i].text)
    minutes.append(minutes_raw[i].text)

q2_xpath = "/html/body/div[1]/div/div/div/main/div[2]/div/div[5]/div/div/section[2]/div/nav/ul/li[2]/button"
WebDriverWait(browser, 10).until(expected_conditions.visibility_of_element_located((By.XPATH, q2_xpath)))
browser.find_element(By.XPATH, value=q2_xpath).click()

a_score_raw = soup.find_all("td", class_="playByPlay__score playByPlay__score--away tr Table__TD")
b_score_raw = soup.find_all("td", class_="playByPlay__score playByPlay__score--home tr Table__TD")
minutes_raw = soup.find_all("td", class_="playByPlay__time Table__TD")

for i in range(len(a_score_raw)):
    a_score.append(a_score_raw[i].text)
    b_score.append(b_score_raw[i].text)
    minutes.append(minutes_raw[i].text)

q3_xpath = "/html/body/div[1]/div/div/div/main/div[2]/div/div[5]/div/div/section[2]/div/nav/ul/li[4]/button"
WebDriverWait(browser, 10).until(expected_conditions.visibility_of_element_located((By.XPATH, q3_xpath)))
browser.find_element(By.XPATH, value=q3_xpath).click()

a_score_raw = soup.find_all("td", class_="playByPlay__score playByPlay__score--away tr Table__TD")
b_score_raw = soup.find_all("td", class_="playByPlay__score playByPlay__score--home tr Table__TD")
minutes_raw = soup.find_all("td", class_="playByPlay__time Table__TD")

for i in range(len(a_score_raw)):
    a_score.append(a_score_raw[i].text)
    b_score.append(b_score_raw[i].text)
    minutes.append(minutes_raw[i].text)

q4_xpath = "/html/body/div[1]/div/div/div/main/div[2]/div/div[5]/div/div/section[2]/div/nav/ul/li[4]/button"
WebDriverWait(browser, 10).until(expected_conditions.visibility_of_element_located((By.XPATH, q4_xpath)))
browser.find_element(By.XPATH, value=q4_xpath).click()

a_score_raw = soup.find_all("td", class_="playByPlay__score playByPlay__score--away tr Table__TD")
b_score_raw = soup.find_all("td", class_="playByPlay__score playByPlay__score--home tr Table__TD")
minutes_raw = soup.find_all("td", class_="playByPlay__time Table__TD")

for i in range(len(a_score_raw)):
    a_score.append(a_score_raw[i].text)
    b_score.append(b_score_raw[i].text)
    minutes.append(minutes_raw[i].text)

Die Daten für alle Quartale haben das gleiche Tag und die gleiche Klasse, und Selen schaltet ordnungsgemäß um, sodass dies kein Problem darstellen sollte. Was könnte das Problem sein? Wie kann ich Tabs innerhalb einer Website zum Scrapen wechseln? Fühlen Sie sich frei, auf ein Problem hinzuweisen oder alternative Lösungen vorzuschlagen (oder sogar kostenlose Datenquellen für Gameflow-Daten).

Lösung

<script>Die Daten sind auf dieser Seite eingebettet . Um es zu entschlüsseln, können Sie das folgende Skript verwenden:

import re
import json
import requests


url = "https://www.espn.com/nba/playbyplay/_/gameId/401474876"

data = requests.get(url).text
data = re.search(r"window\['__espnfitt__'\]=(\{.*\});", data).group(1)
data = json.loads(data)

for g in data["page"]["content"]["gamepackage"]["pbp"]["playGrps"]:
    for d in g:
        print(
            "{:<10} {:<5} {:<5} {}".format(
                d["clock"]["displayValue"],
                d["homeScore"],
                d["awayScore"],
                d["text"],
            )
        )

Drucke:

...

2:05       100   111   Sandro Mamukelashvili turnover
1:51       100   111   Lindell Wigginton personal foul
1:42       100   113   Armoni Brooks makes 2-foot dunk (Chris Silva assists)
1:29       103   113   Lindell Wigginton makes 27-foot step back jumpshot
1:18       103   113   MarJon Beauchamp personal foul
1:18       103   114   Chris Silva makes free throw 1 of 2
1:18       103   115   Chris Silva makes free throw 2 of 2
1:03       103   115   Lindell Wigginton offensive foul
1:03       103   115   Lindell Wigginton turnover
50.5       103   115   Chris Silva bad pass (Lindell Wigginton steals)
45.8       106   115   AJ Green makes 28-foot three point shot (MarJon Beauchamp assists)
20.9       106   118   Vit Krejci makes 29-foot three point shot
5.9        109   118   AJ Green makes 28-foot three point jumper
0.0        109   118   End of the 4th Quarter
0.0        109   118   End of Game

BEARBEITEN: Sie können die Daten auch mit dem js2pyModul analysieren:

import js2py
import requests
from bs4 import BeautifulSoup


url = "https://www.espn.com/nba/playbyplay/_/gameId/401474876"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

x = js2py.eval_js(soup.select("script")[2].text)
print(x.to_dict()["page"]["content"]["gamepackage"]["pbp"]["playGrps"])


Beantwortet von –
Andrej Kesely


Antwort geprüft von –
Willingham (FixError Volunteer)

0 Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like