----------- USER MARKS WHAT VALUES TO EXTRACT
{COMPANY}
{ADDRESS} ----------- LOADED TEMPLATE TREE {[root [] [html [] [head [] [title [] [text [value "TITLE"]]]] [body [] [h2 [] [text [value "{HEADING}"]]] [p [] [text [value "{COMPANY}"]]] [p [] [text [value "{ADDRESS}"]]]]]]} ----------- FIXED TEMPLATE HTML:
{COMPANY}
{ADDRESS}
----------- BUILD PATH FROM TEMPLATE root html head title text : TITLE ( path: [in "root" in "html" in "head" in "title"]) body h2 text : {HEADING} ( path: [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING"]) p text : {COMPANY} ( path: [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY"]) p text : {ADDRESS} ( path: [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY" out _ in "p" fo und "ADDRESS"]) [in "root" in "html" in "head" in "title" out _ out _ in "body" in "h2" found "HEADING" out _ in "p" found "COMPANY" out _ in "p" found "ADDRESS " out _ out _ out _ out _] ----------- OPTIMIZE PATH TO TEXT: ADDRESS [in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"] ========================================== SECOND LEVEL OF PATH OPTIMIZER AND WALKER (optimizes path for more robust extraction, not speed) [in "root" in "html" seek-next 0 in "body" seek-next 1 in "p" found "ADDRESS"] ########################################### TEST ON THE SAMPLES ----------- A HTML PAGE TO EXTRACT FROMHonda
Japan ----------- LOADED SAMPLE TREE {[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [h2 [] [text [value "Fast cars"]]] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]]]]]} ----------- FIXED SAMLe HTML:
Honda
Japan
----------- EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE: [in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"] ["ADDRESS" "Japan"] ADDRESS Japan ----------- EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE: ["ADDRESS" "Honda"] ============== ----------- A HTML PAGE TO EXTRACT FROMHonda
Japan ----------- LOADED SAMPLE TREE {[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"] ]] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]]]]]} ----------- FIXED SAMLe HTML:
Honda
Japan
----------- EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE: [in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"] ##ERROR: This path doesn't exist in this tree ! ----------- EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE: ["ADDRESS" "Honda"] ============== ----------- A HTML PAGE TO EXTRACT FROMHonda
Japan
Additional
----------- LOADED SAMPLE TREE {[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"] ]] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]] [p [] [text [value "Additional"]]]]]]} ----------- FIXED SAMLe HTML:Honda
Japan
Additional
----------- EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE: [in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"] ##ERROR: This path doesn't exist in this tree ! ----------- EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE: ["ADDRESS" "Honda"] ============== ----------- A HTML PAGE TO EXTRACT FROMHonda
Japan
Additional
----------- LOADED SAMPLE TREE {[root [] [html [] [head [] [title [] [text [value "car providers"]]]] [body [] [div [] [text [value "ALERT"]]] [h2 [] [text [value "Fast cars"] ]] [table []] [p [] [text [value "Honda"]]] [p [] [text [value "Japan"]]] [p [] [text [value "Additional"]]]]]]} ----------- FIXED SAMLe HTML:Honda
Japan
Additional
----------- EXTRACT VALUES FROM SAMPLE BASED ON TEMPLATE: [in "root" in "html" in "head" out _ in "body" in "h2" out _ in "p" out _ in "p" found "ADDRESS"] ##ERROR: This path doesn't exist in this tree ! ----------- EXTRACT VALUES FROM SAMPLE BASED ON 2LVL OPTIMISED TEMPLATE: ["ADDRESS" "Honda"] ==============