Daily Bulletin

The Conversation

  • Written by Josh Bassett, Data Platform Technical Lead, The Conversation

Here at The Conversation we run a Job Board that requires parsing a whole bunch of job descriptions in HTML and converting them to Markdown. When we originally built the Job Board we looked around for a HTML to Markdown converter library written in Ruby but unfortunately we couldn’t find one, so we built our own: Upmark.

Upmark allows you to easily convert HTML documents to Markdown format:

require "upmark"
html = "<p>messenger <strong>bag</strong> skateboard</p>"
markdown = Upmark.convert(html)
puts markdown
"messenger **bag** skateboard"

It can handle most HTML tags and anything that isn’t able to be converted to Markdown is passed through as HTML.

How does it work?

Upmark does all the heavy lifting using a parser transformer built using the excellent Parslet library. Parslet allows you to define a grammar in plain ruby that is used to parse a document into a syntax tree. The syntax tree can then be arbitrarily transformed, in our case it is transformed into a Markdown document.

The whole process looks something like this:

image Author provided Parse it! The first phase of the process is parsing the input into a syntax tree. To parse a HTML document we first need to define a grammar. A grammar contains the individual rules for parsing the different parts of a document. Rules for parsing simpler elements can be combined together to parse more complex structures. Parslet provides us with the Parslet::Parser class which we extend to define parser: class MyParser < Parslet::Parser # all the rules go here end In the case of Upmark, we first define rules for parsing the more complex parts of a HTML document, like an element. The rule for parsing an element is then decomposed into rules for parsing tags and attributes. These rules are then further broken down into combinations of simpler rules for text, numbers, and whitespace. Consider the following snippet of HTML: <p>hello world!</p> <img src="lol.gif" /> <ol> <li>one</li> <li>two</li> <li>three</li> </ol> This document is just a series of HTML elements, so the first rule we define might be: rule(:element) do start_tag.as(:start_tag) >> # e.g. "<p>" children.as(:children) >> end_tag.as(:end_tag) # e.g. "</p>" end This rule says that in order to parse an element we need a start_tag, some children, and finally an end_tag. The as modifiers define how they are labelled in the resulting syntax tree. Okay, so now what? Let’s break it down further and add the next rule to our parser. To parse a start_tag we need a < character, a name, zero or more attributes (separated by whitespace), some optional whitespace, and finally a > character. rule(:start_tag) do str('<') >> name.as(:name) >> (space >> attribute).repeat.as(:attributes) >> space? >> str('>') end According to the XML spec, a name is just a string limited to a particular range of characters: rule(:name) do match(/[a-zA-Z_:]/) >> match(/[\w:\.-]/).repeat end Here are the rules for parsing whitespace: rule(:space) { match(/\s/).repeat(1) } rule(:space?) { space.maybe } I’ll leave defining the rules for parsing children and attributes as an exercise for the reader (or you can cheat and just look in the Upmark source code). Finally, this is how we apply our parser to the input: tree = MyParser.new.parse(html) Once our parser is applied to a document, a syntax tree is generated. Transform it! The second phase of the whole process is to transform the syntax tree into some desired output. Parslet syntax trees are represented as an array of nested hashes. For example: tree = [ { element: { name: "img", attributes: [{name: "src", value: "http://example.com/lol.gif"}], children: [] } } ] Given the above syntax tree, let’s write a transform which traverses the syntax tree and converts it to Markdown. Again, Parslet makes transforming easier for us by providing the Parslet::Transform class to extend: class MyTransform < Parslet::Transform rule( element: { name: "img", attributes: subtree(:attributes) } ) do |img| src = img[:attributes].find {|attribute| attribute["name"] == "src" }["value"] "![](#{src})" end end The MyTransform transform matches an img element with a subtree of attributes. It then plucks out the src attribute and returns the Markdown for an image. This is how we apply the transform to the syntax tree: markdown = MyTransform.new.apply(tree) puts markdown "![](http://example.com/lol.gif)" Turtles all the way down So how did we write a parser that converts an entire HTML document to Markdown? The answer is simple: it’s turtles all the way down. By combining multiple rules and transforms, we can break a big problem down into a series of smaller problems. Hopefully this gives you some insight into how to write your own parser using Parslet, and if you happen to need a handy HTML to Markdown converter then please check out Upmark.

Authors: Josh Bassett, Data Platform Technical Lead, The Conversation

Read more http://theconversation.com/converting-html-to-markdown-with-upmark-65788

Writers Wanted

From 'common scolds' to feminist reclamation: the fraught history of women and swearing in Australia


Different Ways to Incorporate Natural Stone into Your Home


The Conversation


Prime Minister Interview with Ben Fordham, 2GB

BEN FORDHAM: Scott Morrison, good morning to you.    PRIME MINISTER: Good morning, Ben. How are you?    FORDHAM: Good. How many days have you got to go?   PRIME MINISTER: I've got another we...

Scott Morrison - avatar Scott Morrison

Prime Minister Interview with Kieran Gilbert, Sky News

KIERAN GILBERT: Kieran Gilbert here with you and the Prime Minister joins me. Prime Minister, thanks so much for your time.  PRIME MINISTER: G'day Kieran.  GILBERT: An assumption a vaccine is ...

Daily Bulletin - avatar Daily Bulletin

Did BLM Really Change the US Police Work?

The Black Lives Matter (BLM) movement has proven that the power of the state rests in the hands of the people it governs. Following the death of 46-year-old black American George Floyd in a case of ...

a Guest Writer - avatar a Guest Writer

Business News

Nisbets’ Collab with The Lobby is Showing the Sexy Side of Hospitality Supply

Hospitality supply services might not immediately make you think ‘sexy’. But when a barkeep in a moodily lit bar holds up the perfectly formed juniper gin balloon or catches the light in the edg...

The Atticism - avatar The Atticism

Buy Instagram Followers And Likes Now

Do you like to buy followers on Instagram? Just give a simple Google search on the internet, and there will be an abounding of seeking outcomes full of businesses offering such services. But, th...

News Co - avatar News Co

Cybersecurity data means nothing to business leaders without context

Top business leaders are starting to realise the widespread impact a cyberattack can have on a business. Unfortunately, according to a study by Forrester Consulting commissioned by Tenable, some...

Scott McKinnel, ANZ Country Manager, Tenable - avatar Scott McKinnel, ANZ Country Manager, Tenable

News Co Media Group

Content & Technology Connecting Global Audiences

More Information - Less Opinion