polyglot.htmlCleaner (class) ∞
-
class
polyglot.
htmlCleaner
(log, settings, url, outputDirectory=False, title=False, style=True, metadata=True, h1=True)[source] ∞ A parser/cleaner to strip a webpage article of all cruft and neatly present it with some nice css
- Key Arguments:
log
– loggersettings
– the settings dictionaryurl
– the URL to the HTML page to parse and cleanoutputDirectory
– path to the directory to save the output html file totitle
– title of the document to save. If False will take the title of the HTML page as the filename. Default False.style
– add polyglot’s styling to the HTML document. Default Truemetadata
– include metadata in generated HTML. Default Trueh1
– include title as H1 at the top of the doc. Default True
Usage:
To generate the HTML page, using the title of the webpage as the filename:
from polyglot import htmlCleaner cleaner = htmlCleaner( log=log, settings=settings, url="http://www.thespacedoctor.co.uk/blog/2016/09/26/mysqlSucker-index.html", outputDirectory="/tmp" ) cleaner.clean()
Or specify the title of the document and remove styling, metadata and title:
from polyglot import htmlCleaner cleaner = htmlCleaner( log=log, settings=settings, url="http://www.thespacedoctor.co.uk/blog/2016/09/26/mysqlSucker-index.html", outputDirectory="/tmp", title="my_clean_doc", style=False, metadata=False, h1=False ) cleaner.clean()
-
__init__
(log, settings, url, outputDirectory=False, title=False, style=True, metadata=True, h1=True)[source] ∞
Methods
__init__
(log, settings, url[, ...])clean
()parse and clean the html document with Mercury Parser