polyglot.htmlCleaner (class)

class polyglot.htmlCleaner(log, settings, url, outputDirectory=False, title=False, style=True, metadata=True, h1=True)[source]

A parser/cleaner to strip a webpage article of all cruft and neatly present it with some nice css

Key Arguments:
  • log – logger
  • settings – the settings dictionary
  • url – the URL to the HTML page to parse and clean
  • outputDirectory – path to the directory to save the output html file to
  • title – title of the document to save. If False will take the title of the HTML page as the filename. Default False.
  • style – add polyglot’s styling to the HTML document. Default True
  • metadata – include metadata in generated HTML. Default True
  • h1 – include title as H1 at the top of the doc. Default True

Usage:

To generate the HTML page, using the title of the webpage as the filename:

from polyglot import htmlCleaner
cleaner = htmlCleaner(
    log=log,
    settings=settings,
    url="http://www.thespacedoctor.co.uk/blog/2016/09/26/mysqlSucker-index.html",
    outputDirectory="/tmp"
)
cleaner.clean()  

Or specify the title of the document and remove styling, metadata and title:

from polyglot import htmlCleaner
cleaner = htmlCleaner(
    log=log,
    settings=settings,
    url="http://www.thespacedoctor.co.uk/blog/2016/09/26/mysqlSucker-index.html",
    outputDirectory="/tmp",
    title="my_clean_doc",
    style=False,
    metadata=False,
    h1=False
)
cleaner.clean() 
__init__(log, settings, url, outputDirectory=False, title=False, style=True, metadata=True, h1=True)[source]

Methods

__init__(log, settings, url[, ...])
clean() parse and clean the html document with Mercury Parser