get values from table with BeautifulSoup Python -
i have table extracting links , text. although can 1 or other. idea how both?
essentially need pull text: "text extract here"
tr in rows: cols = tr.findall('td') count = len(cols) if len(cols) >1: third_column = tr.findall('td')[2].contents third_column_text = str(third_column) third_columnsoup = beautifulsoup(third_column_text) #issue starts here. how can either text of elm <td>text here</td> or href text<a href="somewhere.html">text here</a> elm in third_columnsoup.findall("a"): #print elm.text, third_columnsoup item = { "code": random.upper(), "name": elm.text } items.insert(item )
the html code following
<table cellpadding="2" cellspacing="0" id="listresults"> <tbody> <tr class="even"> <td colspan="4">sort results: <a href= "/~/search/af.aspx?some=lol&category=all&page=0&string=&s=a" rel="nofollow" title= "sort results in alphabetical order">alphabetical</a> | <strong>rank</strong> <a href="/as.asp#rank">?</a></td> </tr> <tr class="even"> <th>aaa</th> <th>vvv.</th> <th>gdfgd</th> <td></td> </tr> <tr class="odd"> <td align="right" width="32">******</td> <td nowrap width="60"><a href="/aaa.html" title= "more info , direct link meaning...">aaa</a></td> <td>text extract here</td> <td width="24"></td> </tr> <tr class="even"> <td align="right" width="32">******</td> <td nowrap width="60"><a href="/somelink.html" title="more info , direct link meaning...">aaa</a></td> <td><a href= "http://www.fdssfdfdsa.com/aaa">text extract here</a></td> <td width="24"> <a href= "/~/search/google.aspx?q=lhfjl&f=a&cx=partner-pub-2259206618774155:1712475319&cof=forid:10&ie=utf-8"><img border="0" height="21" src="/~/st/i/find2.gif" width="21"></a> </td> </tr> <tr> <td width="24"></td> </tr> <tr> <td align="center" colspan="4" style="padding-top:6pt"> <b>note:</b> have 5575 other definitions <strong><a href= "http://www.ddfsadfsa.com/aaa.html">aaa</a></strong> in our database</td> </tr> </tbody> </table>
you can use text
property on td
element:
from bs4 import beautifulsoup html = """here goes html""" soup = beautifulsoup(html, 'html.parser') tr in soup.find_all('tr'): columns = tr.find_all('td') if len(columns) > 2: print columns[2].text
prints:
text extract here text extract here
hope helps.
Comments
Post a Comment