regex - Regular expression to parse values from somewhat complex HTML table -


i have web page contains lot of tables. cannot change page need way work data on page in different application, need able parse , extract data. terrible regular expressions appreciate on this. use regular expression in php (laravel) application if that's relevant syntax.

the web page need parse contains lot of these (among other things):

<!-- post number: 10000 --> <!-- 127.0.0.1  127.0.0.1 --> <table class="message" cellspacing="0" cellpadding="0" border="0">     <tr>         <td>             <table cellspacing="0" cellpadding="0" border="0">                 <tr>                     <td class="tableheader2" nowrap>                         <b>name: </b> firstname lastname                     </td>                     <td class="tableheader2" nowrap>                         <a href="url.html?param=10000" target="_blank">                             <img src="image.png" alt="alt message" border="0">                         </a>                         &nbsp;                         <a href="url2.html?param2=20000">                             <img src="image2.png" alt="alt message" border="0">                         </a>                         &nbsp;                     </td>                     <td class="tableheader2" width="100%">                         &nbsp;                     </td>                 </tr>                 <tr>                     <td class="tableheader2" width=520 colspan="3">                         <b>                             sent:                           </b>                         2014-01-01 11:00:00<br>                     </td>                 </tr>             </table>         </td>     </tr>     <tr>         <td class="tableheader2">             <table class="tableheader2" cellspacing=0 cellpadding=0 border=0>                 <tr>                     <td>                         &nbsp;                     </td>                     <td>                         lorem ipsum dolor sit amet, consectetur adipisicing elit. quos, amet neque non voluptate facilis natus ullam impedit veritatis libero maiores.                     </td>                     <td>                         &nbsp;                     </td>                 </tr>             </table>         </td>     </tr> </table> <hr align="left"> 

that's 1 of many such posts in long row. have edited bit (indents) readability.

what need able parse entire page , grab of these elements (i using values example abow, off course anything):

  • 10000 (from post number comment)
  • firstname lastname
  • 2014-01-01 11:00:00
  • lorem ipsum dolor sit amet, consectetur adipisicing elit. quos, amet neque non voluptate facilis natus ullam impedit veritatis libero maiores.

any appreciated. have provided sample code, none of own futile attempts close propably contra productive.

this kind of stuff has guess work, domdocument can help:

$d = new domdocument; $d->loadhtml($html);  $x = new domxpath($d);  foreach ($x->query('//table[@class="message"]') $message) {     // find preceding comment     $start = $message->previoussibling;     while ($start && !preg_match('/post number:\s*(\d+)/', $start->nodevalue, $match)) {         $start = $start->previoussibling;     }     if ($start === null) {         continue; // comment not found     }     $post = $match[1];     foreach ($x->query('tr[1]//td[@class="tableheader2"]', $message) $hdr) {         if (preg_match('/name:\s*(.*)/', $hdr->nodevalue, $match)) {             $name = rtrim($match[1]); // found name         } elseif (preg_match('/sent:\s*(.*)/', $hdr->nodevalue, $match)) {             $sent = rtrim($match[1]); // found sent         }     }     // find description next row     $desc = trim($x->query('tr[2]//table[@class="tableheader2"]/tr/td[2]', $message)->item(0)->nodevalue);     echo "post: $post\nname: $name\nsent: $sent\ndesc: $desc\n"; }        

Comments

Popular posts from this blog

c# - How to get the current UAC mode -

postgresql - Lazarus + Postgres: incomplete startup packet -

javascript - Ajax jqXHR.status==0 fix error -