Paste: html autoextractor 2nd level opt
Author: | refaktor |
Mode: | text |
Date: | Fri, 27 May 2011 21:28:28 |
Plain Text |
-----------
USER MARKS WHAT VALUES TO EXTRACT
<html><title>TITLE</title><body><h2>{HEADING}</h2><p>{COMPANY}<p>{ADDRESS}</html>
-----------
LOADED TEMPLATE TREE
{[root [] [html [] [head [] [title [] [text [value "TITLE"]]]] [body [] [h2 [] [text [value "{HEADING}"]]] [p [] [text [value "{COMPANY}"]]] [p
[] [text [value "{ADDRESS}"]]]]]]}
-----------
FIXED TEMPLATE HTML:
<html>
<head>
<title>TITLE</title>
</head>
<body>
<h2>{HEADING}</h2>
<p>{COMPANY}</p>
<p>{ADDRESS}</p>
</body>
</html>
-----------
BUILD PATH FROM TEMPLATE
root
html
head
title
text : TITLE
body
h2
text : {HEADING}
p
text : {COMPANY}
p
text : {ADDRESS}
[in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY" out _ in "p" found "ADDRESS
" out _ out _ out _ out _]
-----------
OPTIMIZE PATH TO TEXT: ADDRESS
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
==========================================
SECOND LEVEL OF PATH OPTIMIZER AND WALKER
(optimizes path for more robust extraction, not speed)
[in "root" in "html" seek-next 0 in "body" seek-next 1 in "p" found "ADDRESS"]
###########################################
TEST ON THE SAMPLES
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><h2>Fast cars</h2><p>Honda</p><p>Japan</html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [h2 [] [text [value "Fast cars"]]] [p [] [text [value "Honda"]]]
[p [] [text [value "Japan"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
<head>
<title>car providers</title>
</head>
<body>
<h2>Fast cars</h2>
<p>Honda</p>
<p>Japan</p>
</body>
</html>
-----------
EXTRACT VALUES FROM SAMPLE BASED ON PATH:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
["ADDRESS" "Japan"]
ADDRESS Japan
-----------
EXTRACT VALUES FROM SAMPLE BASED ON LVL2 PATH:
[in "root" in "html" seek-next 0 in "body" seek-next 1 in "p" found "ADDRESS"]
["ADDRESS" "Japan"]
==============
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><body><div>ALERT</div><h2>Fast cars</h2><p>Honda</p><p>Japan</html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"]
]] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
<head>
<title>car providers</title>
</head>
<body>
<div>
ALERT</div>
<h2>Fast cars</h2>
<p>Honda</p>
<p>Japan</p>
</body>
</html>
-----------
EXTRACT VALUES FROM SAMPLE BASED ON PATH:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
##ERROR: This path doesn't exist in this tree !
-----------
EXTRACT VALUES FROM SAMPLE BASED ON LVL2 PATH:
[in "root" in "html" seek-next 0 in "body" seek-next 1 in "p" found "ADDRESS"]
["ADDRESS" "Japan"]
==============
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><body><div>ALERT</div><h2>Fast cars</h2><p>Honda</p><p>Japan<p>Additional</p></html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"]
]] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]] [p [] [text [value "Additional"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
<head>
<title>car providers</title>
</head>
<body>
<div>
ALERT</div>
<h2>Fast cars</h2>
<p>Honda</p>
<p>Japan</p>
<p>Additional</p>
</body>
</html>
-----------
EXTRACT VALUES FROM SAMPLE BASED ON PATH:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
##ERROR: This path doesn't exist in this tree !
-----------
EXTRACT VALUES FROM SAMPLE BASED ON LVL2 PATH:
[in "root" in "html" seek-next 0 in "body" seek-next 1 in "p" found "ADDRESS"]
["ADDRESS" "Japan"]
==============
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><body><div><p>ALERT</p></div><h2>Fast cars</h2><table><p>some stuff</p></table> <p>Honda</p><p>Japan<p>Additio
nal</p><p>More add</p></html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [p [] [text [value "ALERT"]]]] [h2 [] [text [value "Fast
cars"]]] [table []] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]] [p [] [text [value "Additional"]]] [p [] [text [value "More add
"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
<head>
<title>car providers</title>
</head>
<body>
<div>
<p>ALERT</p>
</div>
<h2>Fast cars</h2>
<table>
</table>
<p>Honda</p>
<p>Japan</p>
<p>Additional</p>
<p>More add</p>
</body>
</html>
-----------
EXTRACT VALUES FROM SAMPLE BASED ON PATH:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
##ERROR: This path doesn't exist in this tree !
-----------
EXTRACT VALUES FROM SAMPLE BASED ON LVL2 PATH:
[in "root" in "html" seek-next 0 in "body" seek-next 1 in "p" found "ADDRESS"]
["ADDRESS" "Japan"]
==============
New Annotation