NAME
    HTML::Untemplate - web scraping assistant
VERSION
    version 0.019
DESCRIPTION
    Suppose you have a set of HTML documents generated by populating the
    same template with the data from some kind of database.
    HTML::Untemplate is a set of command-line tools ("xpathify",
    "untemplate") and modules (HTML::Linear and it's dependencies) which
    assist in original data retrieval.
    This process is also known as wrapper induction
    .
    To achieve this goal, HTML tree nodes are presented as XPath/content
    pairs. HTML documents linearized this way can be easily inspected
    manually or with a diff tool. Please refer to "EXAMPLES".
    Despite being named similarly to HTML::Template, this distribution is
    not directly related to it. Instead, it attempts to reverse the
    templating action, whatever the template agent used.
 Why?
    Suppose you have a CMS. Typical CMS works roughly as this (data flows
    bottom-down):
                RDBMS
          scripting language
                 HTML
             HTTP server
                (...)
              HTTP agent
            layout engine
                screen
                 user
    Consider the first 3 steps: RDBMS => scripting language => HTML
    This is "applying template".
    Now, consider this: HTML => scripting language => RDBMS
    I would call that "un-applying template", or "untemplate" :)
    The practical application of this set of tools is to assist in creation
    of web scrappers.
    A similar (however completely unrelated) approach is described in the
    paper XPath-Wrapper Induction for Data Extraction
    .
 Human-readability
    Consider the following HTML node address representations:
      * 0.1.3.0.0.4.0.0.0.2 (HTML::TreeBuilder internal address
      representation);
      * /html/body/div[4]/div/div[1]/table[2]/tr/td/ul/li[3] (HTML::Linear,
      strict);
      * //td[1]/ul[1]/li[3] (HTML::Linear, strict, shrink);
      *
      /html/body[@class='section_home']/div[@id='content_holder'][1]/div[@i
      d='content']/div[@id='main']/table[@class='content_table'][2]/tr/td/u
      l/li[@class='rss_content rss_content_col'][2] (HTML::Linear,
      non-strict);
      * //li[@class='rss_content rss_content_col'][2] (HTML::Linear,
      non-strict, shrink).
    They all point to the same node, however, their verbosity/readability
    vary. The strict mode specifies tag names and positions only. Disabling
    strict will use additional data from CSS selectors. Shrink mode
    attempts to find the shortest XPath unique for every node (/html/body
    is shared among almost all nodes, thus is likely to be irrelevant).
EXAMPLES
 xpathify
    The xpathify tool flatterns the HTML tree into key/value list:
        
        
            
                Hello HTML
            
            
                Hello World!
                This is a sample HTML
                Beware!
                HTML is not XML!
                Have a nice day.
            
        
    Becomes:
    (HTML block)
    The keys are in XPath format, while the values are respective content
    from the HTML tree. Theoretically, it could be possible to reassemble
    the HTML tree from the flat key/value list this tool generates.
 untemplate
    The untemplate tool flatterns a set of HTML documents using the
    algorithm from xpathify. Then, it strips the shared key/value pairs.
    The "rest" is composed of original values fed into the template engine.
    And this is how the result actually looks like with some simple
    real-world examples (quotes 1839  and 2486
     from bash.org):
    (HTML block)
MODULES
    May be used to serialize/flattern HTML documents by your own:
      * HTML::Linear - represent HTML::Tree as a flat list
      * HTML::Linear::Element - represent elements to populate HTML::Linear
      * HTML::Linear::Path - represent paths inside HTML::Tree
REFERENCES
      * Wrapper (data mining)
      
      * XPath-Wrapper Induction for Data Extraction
      
      * Extracting Data from HTML Using TreeBuilder Node IDs
      
      * Web Scraping Made Simple with SiteScraper
      
SEE ALSO
      * HTML::Similarity
      * Template::Extract
      * XML::DifferenceMarkup
      * XML::XSH2
AUTHOR
    Stanislaw Pusep 
COPYRIGHT AND LICENSE
    This software is copyright (c) 2014 by Stanislaw Pusep.
    This is free software; you can redistribute it and/or modify it under
    the same terms as the Perl 5 programming language system itself.