[FIXED] Web-Scraping – Header und Sub-Header

Ausgabe

Ich möchte Web-Scraping mit bs4 https://en.wikipedia.org/wiki/List_of_countries_by_BIP_(nominal)

Der Code, den ich habe, verschrottet nur die Header

table1 = gdp[0]
body = table1.find_all("tr")
head = body[0] 
headings = []
for item in head.find_all("th"): 

        item = (item.text).rstrip("\n")

        headings.append(item)

df = pd.DataFrame(columns=headings)
df.head()```


I need help to scrap the header and sub headers[![enter image description here][1]][1]. The expectation is pandas data frame should look like [![enter image description here][2]][2]


  [1]: https://i.stack.imgur.com/mBWOm.png

Lösung

Verwendung read_htmlmit ausgewählter dritter Tabelle, header=[0, 1]ist für MultiIndex. Der nächste Schritt ist das Abflachen – das Entfernen von Werten danach [und das Verbinden beider Ebenen unterscheidet sich im Listenverständnis:

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
df = pd.read_html(url, header=[0, 1])[2]

df.columns = [(f'{a.split("[")[0]} {b}') if a!=b else a for a, b in df.columns]
print (df)
    Country/Territory UN Region  ... World Bank Estimate World Bank Year
0               World         —  ...            96100091            2021
1       United States  Americas  ...            22996100            2021
2               China      Asia  ...            17734063            2021
3               Japan      Asia  ...             4937422            2021
4             Germany    Europe  ...             4223116            2021
..                ...       ...  ...                 ...             ...
212             Palau   Oceania  ...                 258            2020
213          Kiribati   Oceania  ...                 181            2020
214             Nauru   Oceania  ...                 133            2021
215        Montserrat  Americas  ...                   —               —
216            Tuvalu   Oceania  ...                  63            2021

[217 rows x 8 columns]

Konvertieren Sie bei Bedarf auch Werte in numerische Verwendung:

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
df = pd.read_html(url, header=[0, 1], na_values=['—'])[2]

df.columns = [(f'{a.split("[")[0]} {b}') if a!=b else a for a, b in df.columns]
obj_cols = df.select_dtypes(object).columns

df[obj_cols] = df[obj_cols].apply(lambda x: x.str.split(']').str[-1])

df.iloc[:, 2:] = df.iloc[:, 2:].replace(',','', regex=True).apply(pd.to_numeric)
print (df.head())
  Country/Territory UN Region  IMF Estimate  IMF Year  \
0             World       NaN    93863851.0    2021.0   
1     United States  Americas    25346805.0    2022.0   
2             China      Asia    19911593.0    2022.0   
3             Japan      Asia     4912147.0    2022.0   
4           Germany    Europe     4256540.0    2022.0   

   United Nations Estimate  United Nations Year  World Bank Estimate  \
0               87461674.0               2020.0           96100091.0   
1               20893746.0               2020.0           22996100.0   
2               14722801.0               2020.0           17734063.0   
3                5057759.0               2020.0            4937422.0   
4                3846414.0               2020.0            4223116.0   

   World Bank Year  
0           2021.0  
1           2021.0  
2           2021.0  
3           2021.0  
4           2021.0  


Beantwortet von –
jezrael


Antwort geprüft von –
Clifford M. (FixError Volunteer)

0 Shares:
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like