html - Python and Selenium - Scrape data from multiple siblings -

- September 15, 2010

okay i'm new python , of course selenium. i'm trying scrape page data , work data in python , have selenium click links , store times etc...

the issue i've come across page isn't formatted way i'd like. instead of having this... title link1 link2 title2 link3 link4/a> have this

<tr>     <td>title<td> </tr> <tr>     <td>         <a href>link1</a>     </td> </tr> <tr>     <td>         <a href>link2</a>     </td> </tr> <tr>     <td>         <a href>link3</a>     </td> </tr>

heres html i'm working - http://pastebin.com/663t7mxc

what i'm trying is, of links categorise them based on title come under. e.g. title link 1 link 2 title 2 link 3 link 4 link 5 title 3 link 6

and on.

since links aren't children of same tag title i'm finding it's impossible me do.

this have far

def test():     print ("testing")     browser = webdriver.chrome()     browser.get("http://urlforpage.com")     meetings = browser.find_elements_by_xpath('/html/body/div[2]/table[2]/tbody/tr/td')     i=0     meet in meetings:         venue = meet.get_attribute("class")         if venue == "bold":             print "venue: " + str(i) + " " + meet.text             i+=1         elif venue == "racing-insert-linked-events nextoff-inner-wrapper nextoff-scrollable-wrapper":             print ("links")             print venue.href   test()

i'm pulling title out based on "bold" class of class, issue is, don't know how pull url , link text links inside other tags.

any appreciated. thanks

trying change little of code possible, you're after?

def test():     print ('testing')     browser = webdriver.chrome()     browser.get('http://urlforpage.com')     meetings = browser.find_elements_by_xpath('/html/body/div[2]/table[2]/tbody/tr/td')     meet in meetings:         if meet.get_attribute('class') == 'bold':             print 'venue: {venue}'.format(venue=meet.text)         else:             try:                 anchor = meet.find_element_by_tag_name('a')                 print 'link: {link}, text: {text}'.format(link = anchor.get_attribute('href'), text = anchor.text)             except nosuchelementexception:                 pass  # worried if neither title (bold) nor contains anchor?   test()

Search This Blog

Cap

html - Python and Selenium - Scrape data from multiple siblings -

Comments

Post a Comment

Popular posts from this blog

Need to Replace properties of single sql file using bat file -

postgresql - Lazarus + Postgres: incomplete startup packet -

c# - How to get the current UAC mode -