polyglot.htmlCleaner (class) ∞

class polyglot.htmlCleaner(log, settings, url, outputDirectory=False, title=False, style=True, metadata=True, h1=True)[source] ∞

A parser/cleaner to strip a webpage article of all cruft and neatly present it with some nice css

Key Arguments:

log – logger
settings – the settings dictionary
url – the URL to the HTML page to parse and clean
outputDirectory – path to the directory to save the output html file to
title – title of the document to save. If False will take the title of the HTML page as the filename. Default False.
style – add polyglot’s styling to the HTML document. Default True
metadata – include metadata in generated HTML. Default True
h1 – include title as H1 at the top of the doc. Default True

Usage:

To generate the HTML page, using the title of the webpage as the filename:

from polyglot import htmlCleaner
cleaner = htmlCleaner(
    log=log,
    settings=settings,
    url="http://www.thespacedoctor.co.uk/blog/2016/09/26/mysqlSucker-index.html",
    outputDirectory="/tmp"
)
cleaner.clean()  

Or specify the title of the document and remove styling, metadata and title:

from polyglot import htmlCleaner
cleaner = htmlCleaner(
    log=log,
    settings=settings,
    url="http://www.thespacedoctor.co.uk/blog/2016/09/26/mysqlSucker-index.html",
    outputDirectory="/tmp",
    title="my_clean_doc",
    style=False,
    metadata=False,
    h1=False
)
cleaner.clean() 

__init__(log, settings, url, outputDirectory=False, title=False, style=True, metadata=True, h1=True)[source] ∞

Methods

`__init__`(log, settings, url[, ...])
`clean`()	parse and clean the html document with Mercury Parser