# Parse HTML Components

Parsing HTML and extract the relevant information is something we can use in many contexts: scan a page for a price change, extract a component, detect the broken links .. etc.&#x20;

AppSeed, in particular, uses HTML parsing for two things:

* Page structure detection
* Component extraction&#x20;

For newcomers, **AppSeed** uses automation tools to convert lifeless UI Kits into simple starters generated in many frameworks and patterns. For instance, this open-source design - [**Pixel Lite**](/docs/content/bootstrap-template/pixel-lite-template.md) provided by Themesberg has been *translated* to [Flask](/docs/products/flask-apps/pixel-lite.md) and [Django](/docs/products/django-apps/pixel-lite.md) using **HTML parsing** as the first phase of the translation process.&#x20;

> Required libraries and tools

* [Python](https://www.python.org/) - interpreter&#x20;
* [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - a well-known parsing library&#x20;
* [Lxml](https://lxml.de/) - used to compensate BS4 limitations&#x20;

### The process

The flow explained in this article will execute a few simple steps:

* Load the HTML content - this can be done from a local file or using a LIVE website
* Analyze the page and extract XPATH expression for a component
* Use Lxml library to extract the HTML
* Format the component and save it on disk

> Install libraries via PIP

```
$ pip install requests 
$ pip install lxml
$ pip install beautifulsoup4
```

From this point, all the code is typed using a python console

```python
$ python [ENTER]
>>>
```

> Load the content from local file

```python
>>> f = open('./app/templates/index.html','r')
>>> html_page = f.read()
```

> Load content from remote HTML file (the [LIVE sample](https://demo.themesberg.com/pixel-lite/index.html))

```python
>>> import requests
>>> page = requests.get('https://demo.themesberg.com/pixel-lite/index.html')
>>> html_page = page.content
```

At this point `html_page` variable contains the entire HTML content (string type) and we can use it in BS4 or Lxml to extract the components. To visualize the page structure we can use browser tools:&#x20;

![HTML Parser - Target Component Inspection.](/files/-Ma7XgZQnvLb5_J5xi9z)

The target component will be extracted using an `XPATH` expression provided by the browser:

```markup
//*[@id="features"]
```

To extract the component, this `XPATH` expression will be used in **Lxml** library to isolate the code.&#x20;

```python
>>> from lxml import html
>>> html_dom = html.fromstring( html_page )
>>> component = html_dom.xpath( '//*[@id="features"]' )
  
```

To extract the raw HTML from the `component` object we need to use `tostring` helper exposed by Lxml library:

```python
>>> from lxml.etree import tostring
>>> component_html = tostring( component[0] )
```

The next step is to call Beautiful soup and prettify the HML for saving on disk

```python
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs( component_html )
>>> soup.prettify()
```

The component is fully extracted and parsable:

```markup
  <section class="section section-lg pb-0" id="features">
   <div class="container">
    <div class="row">
     
     ...
     
     <div class="col-12 col-md-4">
      <div class="icon-box text-center mb-5 mb-md-0">
       <div class="icon icon-shape icon-lg bg-white shadow-lg border-light rounded-circle icon-secondary mb-3">
        <span class="fas fa-box-open">
        </span>
       </div>
       <h2 class="my-3 h5">
        80 components
       </h2>
       <p class="px-lg-4">
        Beatifully crafted and creative components made with great care for each pixel
       </p>
      </div>
     </div>
     
     ...
     
     </div>
    </div>
   </div>
  </section>
```

> The rendered version:

![HTML Parser - Extracted Component.](/files/-Ma7Xt41ii0SGASStnBS)

### Resources

* [Use XPath in Beautiful Soup](https://stackoverflow.com/questions/11465555/can-we-use-xpath-with-beautifulsoup) - related article published on StackOverflow
* [Web Scraping](https://docs.python-guide.org/scenarios/scrape/) - the right way (with sample)
* [How to get the content from Lxml object](https://stackoverflow.com/questions/5395948/incredibly-basic-lxml-questions-getting-html-string-content-of-lxml-etree-elem) - StackOverflow article


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://appseed.gitbook.io/docs/content/tutorials/parse-html-components.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
