projects

Parser

2026 · github

Parser is a tiny Go library that fetches a webpage and pulls out the title, meta description, links, and body text. It also tokenises text into words.

It normalises URLs, strips scripts and styles, filters out mailto/tel/etc links, and only keeps words longer than 2 characters when tokenising.

Technical Details

  • Go standard library + goquery for HTML parsing/selection
  • Single struct return: title, description, links, text, tokens
  • URL normalisation (resolve relative, strip fragments)
  • HTML sanitisation (strip <script>, <style>)
  • Word tokenisation with minimum length filter