Paste: html autoextractor 2nd level opt

Author: refaktor
Mode: text
Date: Fri, 27 May 2011 21:17:10
Plain Text |
-----------
USER MARKS WHAT VALUES TO EXTRACT
<html><title>TITLE</title><body><h2>{HEADING}</h2><p>{COMPANY}<p>{ADDRESS}</html>
-----------
LOADED TEMPLATE TREE
{[root [] [html [] [head [] [title [] [text [value "TITLE"]]]] [body [] [h2 [] [text [value "{HEADING}"]]] [p [] [text [value "{COMPANY}"]]] [p
[] [text [value "{ADDRESS}"]]]]]]}
-----------
FIXED TEMPLATE HTML:
<html>
    <head>
        <title>TITLE</title>
        </head>
    <body>
        <h2>{HEADING}</h2>
        <p>{COMPANY}</p>
        <p>{ADDRESS}</p>
        </body>
    </html>

-----------
BUILD PATH FROM TEMPLATE
root
 html
  head
   title
    text : TITLE
    ( path: [in "root" in "html" in "head" in "title"])
  body
   h2
    text : {HEADING}
    ( path: [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING"])
   p
    text : {COMPANY}
    ( path: [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY"])
   p
    text : {ADDRESS}
    ( path: [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY" out _ in "p" fo
und "ADDRESS"])
[in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY" out _ in "p" found "ADDRESS
" out _ out _ out _ out _]
-----------
OPTIMIZE PATH TO TEXT: ADDRESS
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]

==========================================
SECOND LEVEL OF PATH OPTIMIZER AND WALKER
(optimizes path for more robust extraction, not speed)
[in "root" in "html" seek-next 0 in "body" seek-next 1 in "p" found "ADDRESS"]

###########################################
TEST ON THE SAMPLES
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><h2>Fast cars</h2><p>Honda</p><p>Japan</html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [h2 [] [text [value "Fast cars"]]] [p [] [text [value "Honda"]]]
 [p [] [text [value "Japan"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
    <head>
        <title>car providers</title>
        </head>
    <body>
        <h2>Fast cars</h2>
        <p>Honda</p>
        <p>Japan</p>
        </body>
    </html>

-----------
EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
["ADDRESS" "Japan"]
ADDRESS Japan
-----------
EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE:
["ADDRESS" "Japan"]
==============
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><body><div>ALERT</div><h2>Fast cars</h2><p>Honda</p><p>Japan</html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"]
]] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
    <head>
        <title>car providers</title>
        </head>
    <body>
        <div>
            ALERT</div>
        <h2>Fast cars</h2>
        <p>Honda</p>
        <p>Japan</p>
        </body>
    </html>

-----------
EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
##ERROR: This path doesn't exist in this tree !
-----------
EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE:
["ADDRESS" "Japan"]
==============
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><body><div>ALERT</div><h2>Fast cars</h2><p>Honda</p><p>Japan<p>Additional</p></html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"]
]] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]] [p [] [text [value "Additional"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
    <head>
        <title>car providers</title>
        </head>
    <body>
        <div>
            ALERT</div>
        <h2>Fast cars</h2>
        <p>Honda</p>
        <p>Japan</p>
        <p>Additional</p>
        </body>
    </html>

-----------
EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
##ERROR: This path doesn't exist in this tree !
-----------
EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE:
["ADDRESS" "Japan"]
==============
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><body><div><p>ALERT</p></div><h2>Fast cars</h2><table><p>some stuff</p></table> <p>Honda</p><p>Japan<p>Additio
nal</p><p>More add</p></html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [p [] [text [value "ALERT"]]]] [h2 [] [text [value "Fast
 cars"]]] [table []] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]] [p [] [text [value "Additional"]]] [p [] [text [value "More add
"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
    <head>
        <title>car providers</title>
        </head>
    <body>
        <div>
            <p>ALERT</p>
            </div>
        <h2>Fast cars</h2>
        <table>
            </table>
        <p>Honda</p>
        <p>Japan</p>
        <p>Additional</p>
        <p>More add</p>
        </body>
    </html>

-----------
EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
##ERROR: This path doesn't exist in this tree !
-----------
EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE:
["ADDRESS" "Japan"]
==============
>>

New Annotation

Summary:
Author:
Mode:
Body: