Paste: html autoextractor 2nd level opt
Author: | refaktor |
Mode: | factor |
Date: | Fri, 27 May 2011 20:07:32 |
Plain Text |
-----------
USER MARKS WHAT VALUES TO EXTRACT
<html><title>TITLE</title><body><h2>{HEADING}</h2><p>{COMPANY}<p>{ADDRESS}</html>
-----------
LOADED TEMPLATE TREE
{[root [] [html [] [head [] [title [] [text [value "TITLE"]]]] [body [] [h2 [] [text [value "{HEADING}"]]] [p [] [text [value "{COMPANY}"]]] [p
[] [text [value "{ADDRESS}"]]]]]]}
-----------
FIXED TEMPLATE HTML:
<html>
<head>
<title>TITLE</title>
</head>
<body>
<h2>{HEADING}</h2>
<p>{COMPANY}</p>
<p>{ADDRESS}</p>
</body>
</html>
-----------
BUILD PATH FROM TEMPLATE
root
html
head
title
text : TITLE
( path: [in "root" in "html" in "head" in "title"])
body
h2
text : {HEADING}
( path: [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING"])
p
text : {COMPANY}
( path: [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY"])
p
text : {ADDRESS}
( path: [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY" out _ in "p" fo
und "ADDRESS"])
[in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY" out _ in "p" found "ADDRESS
" out _ out _ out _ out _]
-----------
OPTIMIZE PATH TO TEXT: ADDRESS
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
==========================================
SECOND LEVEL OF PATH OPTIMIZER AND WALKER
(optimizes path for more robust extraction, not speed)
[in "root" in "html" seek-next 0 in "body" seek-next 1 in "p" found "ADDRESS"]
###########################################
TEST ON THE SAMPLES
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><h2>Fast cars</h2><p>Honda</p><p>Japan</html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [h2 [] [text [value "Fast cars"]]] [p [] [text [value "Honda"]]]
[p [] [text [value "Japan"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
<head>
<title>car providers</title>
</head>
<body>
<h2>Fast cars</h2>
<p>Honda</p>
<p>Japan</p>
</body>
</html>
-----------
EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
["ADDRESS" "Japan"]
ADDRESS Japan
-----------
EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE:
["ADDRESS" "Honda"]
==============
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><body><div>ALERT</div><h2>Fast cars</h2><p>Honda</p><p>Japan</html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"]
]] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
<head>
<title>car providers</title>
</head>
<body>
<div>
ALERT</div>
<h2>Fast cars</h2>
<p>Honda</p>
<p>Japan</p>
</body>
</html>
-----------
EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
##ERROR: This path doesn't exist in this tree
-----------
EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE:
["ADDRESS" "Honda"]
==============
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><body><div>ALERT</div><h2>Fast cars</h2><p>Honda</p><p>Japan<p>Additional</p></html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"]
]] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]] [p [] [text [value "Additional"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
<head>
<title>car providers</title>
</head>
<body>
<div>
ALERT</div>
<h2>Fast cars</h2>
<p>Honda</p>
<p>Japan</p>
<p>Additional</p>
</body>
</html>
-----------
EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
##ERROR: This path doesn't exist in this tree
-----------
EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE:
["ADDRESS" "Honda"]
==============
-----------
A HTML PAGE TO EXTRACT FROM
<html><title>car providers</title><body><div>ALERT</div><h2>Fast cars</h2><table><p>some stuff</p></table> <p>Honda</p><p>Japan<p>Additional</p>
</html>
-----------
LOADED SAMPLE TREE
{[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"]
]] [table []] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]] [p [] [text [value "Additional"]]]]]]}
-----------
FIXED SAMLe HTML:
<html>
<head>
<title>car providers</title>
</head>
<body>
<div>
ALERT</div>
<h2>Fast cars</h2>
<table>
</table>
<p>Honda</p>
<p>Japan</p>
<p>Additional</p>
</body>
</html>
-----------
EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE:
[in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"]
##ERROR: This path doesn't exist in this tree
-----------
EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE:
["ADDRESS" "Honda"]
==============
New Annotation