Webscrape CNN, injection, beautiful soup, python, requests, HTML

Question

Welcome To Ask or Share your Answers For Others

Webscrape CNN, injection, beautiful soup, python, requests, HTML

asked Jan 27, 2021 in Technique[技术] by 深蓝 (71.8m points)

Webscrape CNN, injection, beautiful soup, python, requests, HTML

Okay, I thought I was crazy because I repeatedly failed at this, but I thought, maybe something is happening with the html that I don't understand.

I have been trying to scrape the 'articles' from cnn.com.

But no matter which way I tried soup.find_all('articles'), or soup.find('body').div('div')...etc with class tags, id, etc. FAIL.

I found this reference: Webscraping from React web application after componentDidMount.

I suspect injection in html is why I am having issues.

I know 0 about injection other than 'html injection attacks' from cyber security reading.

I want the articles, but I am assuming I will need to use a tactic similar to the other stack overflow question link above. I do not know how. Links to help documents or specifically cnn scraping would be appreciated.

Or if someone knows how I could get the 'full data' of the html body element, so that I could do some rearranging in my early code of this definition and then just reassign body.

'Or just tell me I'm an idiot and on the wrong track'

def build_art_d(site):
            
    url = site
    main_l = len(url)
    
    html = requests.get(url).text
    soup = BeautifulSoup(html, 'lxml')
    

    print(soup.prettify())
    
    art_dict = {}
    
    body = soup.find('body')
    print(body.prettify())
    div1 = body.find('div', {'class':'pg-no-rail pg-wrapper'})
    section = div1.find('section',{'id' : 'homepage1-zone-1'})
    div2 = section.find('div', {'class':'l-container'})
    div3 = div2.find('div', {'class':'zn__containers'})
    articles = div3.find_all('article')
    
    for art in articles:
        art_dict[art.text] = art.href
    
        
    #test print
    for article in art_dict:
        print('Article :: {}'.format(article), 'Link :: {}'.format(art_dict[article]))

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-01-27T04:41:28+0000

You can use selinium to enable the data to be filled in by the sites javascript. Then use your existing bs4 code to scrap the articles.

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
driver.get('https://www.cnn.com/')

soup = BeautifulSoup(driver.page_source, 'html.parser')

Categories

Webscrape CNN, injection, beautiful soup, python, requests, HTML

Webscrape CNN, injection, beautiful soup, python, requests, HTML

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags