Okay, I thought I was crazy because I repeatedly failed at this, but I thought, maybe something is happening with the html that I don't understand.
I have been trying to scrape the 'articles' from cnn.com.
But no matter which way I tried soup.find_all('articles'), or soup.find('body').div('div')...etc with class tags, id, etc. FAIL.
I found this reference: Webscraping from React web application after componentDidMount.
I suspect injection in html is why I am having issues.
I know 0 about injection other than 'html injection attacks' from cyber security reading.
I want the articles, but I am assuming I will need to use a tactic similar to the other stack overflow question link above. I do not know how. Links to help documents or specifically cnn scraping would be appreciated.
Or if someone knows how I could get the 'full data' of the html body element, so that I could do some rearranging in my early code of this definition and then just reassign body.
'Or just tell me I'm an idiot and on the wrong track'
def build_art_d(site):
url = site
main_l = len(url)
html = requests.get(url).text
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
art_dict = {}
body = soup.find('body')
print(body.prettify())
div1 = body.find('div', {'class':'pg-no-rail pg-wrapper'})
section = div1.find('section',{'id' : 'homepage1-zone-1'})
div2 = section.find('div', {'class':'l-container'})
div3 = div2.find('div', {'class':'zn__containers'})
articles = div3.find_all('article')
for art in articles:
art_dict[art.text] = art.href
#test print
for article in art_dict:
print('Article :: {}'.format(article), 'Link :: {}'.format(art_dict[article]))
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…