Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
629 views
in Technique[技术] by (71.8m points)

python - Problem while checking internal links with BeautifulSoup and Selenium

Answer here : How to join absolute and relative urls?

I want to check internal links with BeautifulSoup and Selenium.

Script is working when links are like this : full url path

<a href="http...." />

Script is NOT working when links are like this : partial url path

<a href="/internal_link.php" />

My python script :

soup=BeautifulSoup(r,'html5lib')
links=[]
for link in soup.findAll('a'):
    set="True"
    for word in exc:
        if word in str(link.get('href')).lower():
            set="False"
            break
    if set=="True":
        try:
            st = re.search('(S+)', str(link.get('href')).lower())
            st = st.group(0)
            if site in st: # 2 SCENARIOS HERE
                links.append(st)
        except:
            pass

CASE 1 : check all links: full path

if "http" in st:

CASE 2 : Check only internal links: (site is current page) full path

if site in st: 

So, I'm looking for a way to load links even if there is not the full path of the url


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Possible Example

from bs4 import BeautifulSoup

html = '''
<a href="/internal_link.php" />
<a href="http://www.example.com/internal_link.php" />
<a href="/internal_link.php" />

'''

exc = ['http']
url = 'http://www.example.com'

soup=BeautifulSoup(html,'html5lib')
links=[]
for link in soup.findAll('a'):
    for word in exc:
        if word not in str(link.get('href')).lower():
            links.append(''.join([url,link['href']])) 
        if url in str(link.get('href')).lower():
            links.append(link['href']) 
links

Output

['http://www.example.com/internal_link.php',
 'http://www.example.com/internal_link.php',
 'http://www.example.com/internal_link.php']

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share

2.1m questions

2.1m answers

63 comments

56.6k users

...