Parse HTML Components
This page explains how to parse and extract information from a page (local or remote).
Last updated
Was this helpful?
This page explains how to parse and extract information from a page (local or remote).
Last updated
Was this helpful?
Parsing HTML and extract the relevant information is something we can use in many contexts: scan a page for a price change, extract a component, detect the broken links .. etc.
AppSeed, in particular, uses HTML parsing for two things:
Page structure detection
Component extraction
For newcomers, AppSeed uses automation tools to convert lifeless UI Kits into simple starters generated in many frameworks and patterns. For instance, this open-source design - provided by Themesberg has been translated to and using HTML parsing as the first phase of the translation process.
Required libraries and tools
- interpreter
- a well-known parsing library
- used to compensate BS4 limitations
The flow explained in this article will execute a few simple steps:
Load the HTML content - this can be done from a local file or using a LIVE website
Analyze the page and extract XPATH expression for a component
Use Lxml library to extract the HTML
Format the component and save it on disk
Install libraries via PIP
From this point, all the code is typed using a python console
Load the content from local file
At this point html_page
variable contains the entire HTML content (string type) and we can use it in BS4 or Lxml to extract the components. To visualize the page structure we can use browser tools:
The target component will be extracted using an XPATH
expression provided by the browser:
To extract the component, this XPATH
expression will be used in Lxml library to isolate the code.
To extract the raw HTML from the component
object we need to use tostring
helper exposed by Lxml library:
The next step is to call Beautiful soup and prettify the HML for saving on disk
The component is fully extracted and parsable:
The rendered version:
Load content from remote HTML file (the )
- related article published on StackOverflow
- the right way (with sample)
- StackOverflow article