Skip to content Skip to sidebar Skip to footer

How Do I Avoid Data From Different Tabs To Be Concatenated In One Cell When I Scrape A Table?

I scraped this page https://www.capfriendly.com/teams/bruins, specifically looking for the tables under the tab Cap Hit (Fowards, Defense, GoalTenders). I used Python and Beautiful

Solution 1:

Based on the source code, this is some text in specific rows that is conditionally visible depending on what tab you're on (as your title states). The class .hide is added to the child element in the td when it is intended to be hidden on that specific tab.

When you're parsing the td elements to retreive the text, you could filter out those elements which are suppose to be hidden. In doing so, you can retrieve the text that would be visible as if you were viewing the page in a web browser.

In the snippet below, I added a parse_td function which filters out the children span elements with a class of hide. From there, the corresponding text is returned.

import requests, bs4, csv

r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")

with open("csvfile.csv", "w", newline='') as team_data: 
    def parse_td(td):
        filtered_data = [tag.text for tag in td.find_all('span', recursive=False)
                         if 'hide' not in tag.attrs['class']]
        return filtered_data[0] if filtered_data else td.text;

    for tr in table('tr', class_=['odd', 'even']):
        row = [parse_td(td) for td in tr('td')]
        writer = csv.writer(team_data)
        writer.writerow(row)

Post a Comment for "How Do I Avoid Data From Different Tabs To Be Concatenated In One Cell When I Scrape A Table?"