python beautifulsoup level 1 only text

  • Last Update :
  • Techknowledgy :

Limit your search to direct children of the table element only by setting the recursive argument to False:

table = soup.find('div', class_ = 'right1').table
rows = table.find_all('tr', {
   "class": re.compile('list.*')
}, recursive = False)

@MartijnPieters' solution is already perfect, but don't forget that BeautifulSoup allows you to use multiple attributes as well when locating elements. See the following code:

from bs4
import BeautifulSoup as bsoup
import requests as rq
import re

url = "http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31"
r = rq.get(url)
r.encoding = "gb2312"

soup = bsoup(r.content, "html.parser")
div = soup.find("div", class_ = "right1")
rows = div.find_all("tr", {
   "class": re.compile(r "list\d+"),
   "style": "cursor:pointer;"
})

for row in rows:
   first_td = row.find_all("td")[0]
print first_td.get_text().encode("utf-8")

Notice how I also added "style":"cursor:pointer;". This is unique to the top-level rows and is not an attribute of the inner rows. This gives the same result as the accepted answer:

百度汇总
360 搜索
新搜狗
谷歌
微软必应
雅虎
0
有道
其他
   [Finished in 2.6 s]

Suggestion : 2

Last Updated : 24 Jan, 2021,GATE CS 2021 Syllabus

  • bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. For installing the module-
pip install bs4.
  • urllib: urllib is a package that collects several modules for working with URLs. It can also be installed the same way, it is most of the in-built in the environment itself.
pip install urllib
  • bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4

Suggestion : 3

Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match.,Any argument that’s not recognized will be turned into a filter on one of a tag’s attributes. If you pass in a value for an argument called id, Beautiful Soup will filter against each tag’s ‘id’ attribute:,The simplest filter is a string. Pass a string to a search method and Beautiful Soup will perform a match against that exact string. This code finds all the <b> tags in the document:,If you pass in a regular expression object, Beautiful Soup will filter against that regular expression using its search() method. This code finds all the tags whose names start with the letter “b”; in this case, the <body> tag and the <b> tag:

html_doc = """<html>

<head>
   <title>The Dormouse's story</title>
</head>

<body>
   <p class="title"><b>The Dormouse's story</b></p>

   <p class="story">Once upon a time there were three little sisters; and their names were
      <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
      <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
      <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
      and they lived at the bottom of a well.
   </p>

   <p class="story">...</p>
   """
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#

<head>
   # <title>
      # The Dormouse's story
      # </title>
   # </head>
#

<body>
   # <p class="title">
      # <b>
         # The Dormouse's story
         # </b>
      # </p>
   # <p class="story">
      # Once upon a time there were three little sisters; and their names were
      # <a class="sister" href="http://example.com/elsie" id="link1">
         # Elsie
         # </a>
      # ,
      # <a class="sister" href="http://example.com/lacie" id="link2">
         # Lacie
         # </a>
      # and
      # <a class="sister" href="http://example.com/tillie" id="link3">
         # Tillie
         # </a>
      # ; and they lived at the bottom of a well.
      # </p>
   # <p class="story">
      # ...
      # </p>
   # </body>
#

</html>
soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
for link in soup.find_all('a'):
   print(link.get('href'))
# http: //example.com/elsie
   # http: //example.com/lacie
   # http: //example.com/tillie
print(soup.get_text())
# The Dormouse 's story
#
# The Dormouse 's story
#
# Once upon a time there were three little sisters;
and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
#...
from bs4 import BeautifulSoup

with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')

soup = BeautifulSoup("<html>a web page

</html>", 'html.parser')

Suggestion : 4

It initially gets №1 div, then 2 times switches to next div on same nesting level to get to №3.,To locate comments in BeautifulSoup, use the text (or string in the recent versions) argument checking the type to be Comment:,Define a function that takes an element as its only argument. The function should return True if the argument matches.,BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.

Imagine you have the following HTML:

<div>
   <label>Name:</label>
   John Smith
</div>

In this case, you can locate the label element by text and then use .next_sibling property:

from bs4 import BeautifulSoup

data = """
<div>
   <label>Name:</label>
   John Smith
</div>
"""

soup = BeautifulSoup(data, "html.parser")

label = soup.find("label", text="Name:")
print(label.next_sibling.strip())

Basic example:

from bs4 import BeautifulSoup

data = """
<ul>
   <li class="item">item1</li>
   <li class="item">item2</li>
   <li class="item">item3</li>
</ul>
"""

soup = BeautifulSoup(data, "html.parser")

for item in soup.select("li.item"):
print(item.get_text())

To locate comments in BeautifulSoup, use the text (or string in the recent versions) argument checking the type to be Comment:

from bs4 import BeautifulSoup
from bs4 import Comment

data = """
<html>

<body>
   <div>
      <!-- desired text -->
   </div>
</body>

</html>
"""

soup = BeautifulSoup(data, "html.parser")
comment = soup.find(text=lambda text: isinstance(text, Comment))
print(comment)

Define a function that takes an element as its only argument. The function should return True if the argument matches.

def has_href(tag):
   ''
'Returns True for tags with a href attribute'
''
return bool(tag.get("href"))

soup.find_all(has_href) #find all elements with a href attribute
#equivilent using lambda:
   soup.find_all(lambda tag: bool(tag.get("href")))