readability

package module
v0.3.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 13, 2025 License: Apache-2.0 Imports: 9 Imported by: 3

README

go-readability

A Go implementation of Mozilla's Readability library, inspired by @mizchi/readability. This library extracts the main content from web pages, removing clutter like navigation, ads, and unnecessary elements to provide a clean reading experience.

Installation

go get github.com/mackee/go-readability

Usage

As a Library
package main

import (
	"fmt"
	"log"
	"net/http"

	"github.com/mackee/go-readability"
)

func main() {
	// Fetch a web page
	resp, err := http.Get("https://example.com/article")
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	// Parse and extract the main content
	article, err := readability.FromReader(resp.Body, "https://example.com/article")
	if err != nil {
		log.Fatal(err)
	}

	// Access the extracted content
	fmt.Println("Title:", article.Title)
	fmt.Println("Byline:", article.Byline)
	fmt.Println("Content:", article.Content)
	
	// Get content as HTML
	html := article.Content
	
	// Get content as plain text
	text := article.TextContent
	
	// Get metadata
	fmt.Println("Excerpt:", article.Excerpt)
	fmt.Println("SiteName:", article.SiteName)
}
Using the CLI Tool

The package includes a command-line tool that can extract content from a URL:

# Install the CLI tool
go install github.com/mackee/go-readability/cmd/readability@latest

# Extract content from a URL
readability https://example.com/article

# Save the extracted content to a file
readability https://example.com/article > article.html

# Output as markdown
readability --format markdown https://example.com/article > article.md

# Output metadata as JSON
readability --metadata https://example.com/article

Features

  • Extracts the main content from web pages
  • Removes clutter like navigation, ads, and unnecessary elements
  • Preserves important images and formatting
  • Extracts metadata (title, byline, excerpt, etc.)
  • Supports output in HTML or Markdown format
  • Command-line interface for easy content extraction

Testing

This library uses test fixtures based on Mozilla's Readability test suite. Currently, we have implemented a subset of the test cases, with the source HTML files being identical to the original Mozilla implementation.

Test Fixtures

The test fixtures in testdata/fixtures/ are sourced from Mozilla's Readability test suite, with some differences:

  • The source HTML files (source.html) are identical to Mozilla's Readability
  • The expected output HTML (expected.html) may differ due to implementation differences between JavaScript and Go
  • The expected metadata extraction results are aligned with Mozilla's implementation where possible

While not all test cases from Mozilla's Readability are currently implemented, using the same source HTML helps ensure that:

  1. The Go implementation handles the same input as the JavaScript implementation
  2. Regressions can be easily detected
  3. Users can trust the library to process the same types of content as Mozilla's Readability
Fixture Licensing

These fixtures are identical to those used in Mozilla's Readability implementation.

License

Apache License 2.0

Documentation

Overview

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func AddSignificantElementsByClassOrId

func AddSignificantElementsByClassOrId(body *dom.VElement, potentialNodes *[]*dom.VElement)

AddSignificantElementsByClassOrId detects elements with meaningful class names or IDs and adds them to the potentialNodes slice. This helps identify content containers that might not use semantic HTML tags but follow common naming conventions.

Parameters:

  • body: The body element to search within
  • potentialNodes: A pointer to a slice where identified elements will be added

func AnalyzeUrlPattern

func AnalyzeUrlPattern(url string) string

AnalyzeUrlPattern analyzes the pattern of the URL's last part. This is a helper function for debugging and understanding URL patterns. It categorizes the last part of a URL into patterns like "numeric only", "alphanumeric", etc.

Parameters:

  • url: The URL to analyze

Returns:

  • A string describing the pattern of the URL's last part

func AriaTreeToString

func AriaTreeToString(tree *AriaTree) string

AriaTreeToString converts an AriaTree to a string representation. This is useful for debugging and visualizing the accessibility structure of a document.

Parameters:

  • tree: The AriaTree to convert to a string

Returns:

  • A string representation of the tree

func CountAriaNodes

func CountAriaNodes(node *AriaNode) int

CountAriaNodes counts the total number of nodes in an AriaNode tree. This includes the node itself and all its descendants.

Parameters:

  • node: The root node to count from

Returns:

  • The total number of nodes in the tree

func CountNodes

func CountNodes(element *dom.VElement) int

CountNodes counts the number of nodes within a VElement. This includes the element itself and all its descendants (both elements and text nodes).

Parameters:

  • element: The element to count nodes for

Returns:

  • The total number of nodes

func CreateElement

func CreateElement(tagName string) *dom.VElement

CreateElement creates a new element with the given tag name. This is useful for creating new elements to insert into the DOM.

Parameters:

  • tagName: The tag name for the new element

Returns:

  • A new VElement with the specified tag name

func CreateExtractor

func CreateExtractor(options ReadabilityOptions) func(string) (ReadabilityArticle, error)

CreateExtractor creates a custom extractor function with specific options. This is useful when you want to reuse the same extraction configuration multiple times. The returned function can be called with HTML strings to extract content using the predefined options.

Parameters:

  • options: The readability options to use for all extractions

Returns:

  • A function that takes an HTML string and returns a ReadabilityArticle and error

func CreateTextNode

func CreateTextNode(content string) *dom.VText

CreateTextNode creates a new text node with the given content. This is useful for creating text nodes to insert into the DOM.

Parameters:

  • content: The text content for the new node

Returns:

  • A new VText node with the specified content

func ExtractTextContent

func ExtractTextContent(element *dom.VElement) string

ExtractTextContent extracts text content from VElement. This returns only the text nodes' content, without any HTML formatting.

Parameters:

  • element: The element to extract text from

Returns:

  • A string containing all text content from the element and its descendants

func FindMainCandidates

func FindMainCandidates(doc *dom.VDocument, nbTopCandidates int) []*dom.VElement

FindMainCandidates detects nodes that are likely to be the main content candidates, sorted by score. It implements the core scoring algorithm of readability, analyzing elements based on content length, tag types, class names, and other heuristics to identify the most likely content containers.

Parameters:

  • doc: The parsed HTML document
  • nbTopCandidates: The number of top candidates to return

Returns:

  • A slice of the top N candidate elements, sorted by score in descending order

func FindStructuralElements

func FindStructuralElements(doc *dom.VDocument) (
	header *dom.VElement,
	footer *dom.VElement,
	otherSignificantNodes []*dom.VElement,
)

FindStructuralElements detects header, footer, and other significant structural elements in a document. This is particularly useful for pages that are classified as articles but where the main content extraction fails to meet the threshold. It uses semantic tags, ARIA roles, and common class/ID patterns to identify important page structures.

Parameters:

  • doc: The parsed HTML document

Returns:

  • header: The identified page header element, if found
  • footer: The identified page footer element, if found
  • otherSignificantNodes: Other semantically significant elements found in the document

func FormatDocument

func FormatDocument(text string) string

FormatDocument formats the entire document. Merges consecutive line breaks into one, removes extra line breaks at the beginning and end. This produces a cleaner, more readable text output.

Parameters:

  • text: The text to format

Returns:

  • The formatted text

func GetAccessibleName

func GetAccessibleName(element *dom.VElement) string

GetAccessibleName returns the accessible name of an element. It follows the accessible name calculation algorithm, prioritizing aria-label, aria-labelledby, alt, title, and text content. The accessible name is what would be announced by screen readers and other assistive technologies.

Parameters:

  • element: The element to get the accessible name for

Returns:

  • The accessible name as a string

func GetAriaRole

func GetAriaRole(element *dom.VElement) string

GetAriaRole returns the ARIA role of an element. It returns the explicit role attribute or an implicit role based on the tag name. ARIA roles provide semantic meaning to elements for accessibility purposes.

Parameters:

  • element: The element to get the role for

Returns:

  • The ARIA role as a string

func GetArticleByline

func GetArticleByline(doc *dom.VDocument) string

GetArticleByline extracts the author information from the document. It uses various strategies including meta tags and JSON-LD data to find the author or byline information associated with the content.

Parameters:

  • doc: The parsed HTML document

Returns:

  • The extracted author/byline information as a string

func GetArticleTitle

func GetArticleTitle(doc *dom.VDocument) string

GetArticleTitle extracts the article title from the document. It tries various strategies to find the most appropriate title, including examining the <title> element, heading elements, and handling common title patterns like site name separators.

Parameters:

  • doc: The parsed HTML document

Returns:

  • The extracted article title as a string

func GetAttribute

func GetAttribute(element *dom.VElement, name string) string

GetAttribute gets the value of an attribute on an element. Returns an empty string if the attribute doesn't exist.

Parameters:

  • element: The element to get the attribute from
  • name: The name of the attribute to get

Returns:

  • The attribute value, or an empty string if not found

func GetClassWeight

func GetClassWeight(node *dom.VElement) float64

GetClassWeight calculates a score adjustment based on the class name and ID of an element. It returns a positive score for elements likely to contain content and a negative score for elements likely to be noise. This helps the algorithm prioritize content-rich elements and deprioritize elements that typically contain non-content material.

Parameters:

  • node: The element to calculate a class weight for

Returns:

  • A float64 score adjustment (positive for likely content, negative for likely noise)

func GetElementsByTagName

func GetElementsByTagName(element *dom.VElement, tagName string) []*dom.VElement

GetElementsByTagName returns all elements with the specified tag name in the element tree. If tagName is "*", it returns all elements.

Parameters:

  • element: The root element to search from
  • tagName: The tag name to search for, or "*" for all elements

Returns:

  • A slice of elements matching the tag name

func GetElementsByTagNames

func GetElementsByTagNames(element *dom.VElement, tagNames []string) []*dom.VElement

GetElementsByTagNames returns all elements with any of the specified tag names in the element tree. This is useful for finding elements of multiple types in a single pass.

Parameters:

  • element: The root element to search from
  • tagNames: A slice of tag names to search for

Returns:

  • A slice of elements matching any of the tag names

func GetInnerText

func GetInnerText(node dom.VNode, normalizeSpaces bool) string

GetInnerText returns the inner text of an element or text node. If normalizeSpaces is true, consecutive whitespace is normalized to a single space. This extracts all text content from an element and its descendants.

Parameters:

  • node: The node to get text from
  • normalizeSpaces: Whether to normalize whitespace

Returns:

  • The combined text content of the node and its descendants

func GetLinkDensity

func GetLinkDensity(element *dom.VElement) float64

GetLinkDensity calculates the ratio of link text to all text in an element. Returns a value between 0 and 1, where higher values indicate more links. This is useful for identifying navigation areas and other link-heavy sections that are unlikely to be main content.

Parameters:

  • element: The element to calculate link density for

Returns:

  • A float64 between 0 and 1 representing the link density

func GetNodeAncestors

func GetNodeAncestors(node *dom.VElement, maxDepth int) []*dom.VElement

GetNodeAncestors returns the ancestor elements of a node up to a specified depth. If maxDepth is less than or equal to 0, all ancestors are returned. This is useful for traversing up the DOM tree to find parent elements.

Parameters:

  • node: The element to get ancestors for
  • maxDepth: The maximum number of ancestors to return, or <= 0 for all

Returns:

  • A slice of ancestor elements, ordered from closest to furthest

func GetTextDensity

func GetTextDensity(element *dom.VElement) float64

GetTextDensity calculates the ratio of text to child elements in an element. Returns a value where higher values indicate more text-dense content. This helps identify content-rich elements that are likely to be the main content.

Parameters:

  • element: The element to calculate text density for

Returns:

  • A float64 representing the text density

func HasAncestorTag

func HasAncestorTag(node dom.VNode, tagName string, maxDepth int) bool

HasAncestorTag checks if a node has an ancestor with the specified tag name. If maxDepth is less than or equal to 0, all ancestors are checked. This is useful for determining if an element is contained within a specific type of element.

Parameters:

  • node: The node to check ancestors for
  • tagName: The tag name to look for in ancestors
  • maxDepth: The maximum depth to check, or <= 0 for unlimited

Returns:

  • true if an ancestor with the specified tag name is found, false otherwise

func InitializeNode

func InitializeNode(node *dom.VElement)

InitializeNode initializes a node with a readability score. It sets an initial score based on the tag name and adjusts it based on class name and ID. This is a key part of the content scoring algorithm, establishing baseline scores for different HTML elements.

Parameters:

  • node: The element to initialize with a readability score

func IsProbablyContent

func IsProbablyContent(element *dom.VElement) bool

IsProbablyContent determines content probability (simplified version similar to isProbablyReaderable). It checks various properties of an element to determine if it's likely to contain meaningful content, including visibility, class/ID patterns, text length, and link density.

Parameters:

  • element: The element to evaluate

Returns:

  • true if the element is likely to contain meaningful content, false otherwise

func IsProbablyVisible

func IsProbablyVisible(node *dom.VElement) bool

IsProbablyVisible checks if an element is likely to be visible based on its attributes. This helps filter out hidden elements that shouldn't be included in the extracted content.

Parameters:

  • node: The element to check

Returns:

  • true if the element is likely visible, false otherwise

func IsSemanticTag

func IsSemanticTag(element *dom.VElement) bool

IsSemanticTag checks if an element is a semantic tag or contains semantic tags. Semantic tags include main, article, and elements with content-related classes/IDs. These tags provide structural meaning to the content and are strong indicators of meaningful content areas.

Parameters:

  • element: The element to check

Returns:

  • true if the element is or contains semantic tags, false otherwise

func IsSignificantNode

func IsSignificantNode(node *dom.VElement) bool

IsSignificantNode determines if a node is semantically significant. This includes elements like header, footer, main, article, etc. Significant nodes are important structural elements that help understand the page's organization even when the main content extraction fails.

Parameters:

  • node: The element to check

Returns:

  • true if the node is semantically significant, false otherwise

func IsURL

func IsURL(str string) bool

IsURL checks if a string is a valid URL. This is a simple validation function that checks if a string starts with http:// or https:// to determine if it's likely a URL.

Parameters:

  • str: The string to check

Returns:

  • true if the string appears to be a URL, false otherwise

func ParseHTML

func ParseHTML(htmlContent string, baseURI string) (*dom.VDocument, error)

ParseHTML parses an HTML string and returns a virtual DOM document. It uses golang.org/x/net/html for parsing and converts the result to our internal DOM structure. The baseURI parameter is used to resolve relative URLs in the document.

Parameters:

  • htmlContent: The HTML string to parse
  • baseURI: The base URI for resolving relative URLs (can be empty)

Returns:

  • A pointer to a VDocument representing the parsed HTML
  • An error if parsing fails

func PreprocessDocument

func PreprocessDocument(doc *dom.VDocument) *dom.VDocument

PreprocessDocument removes noise elements from the document. This includes removing semantic tags, unnecessary tags, and ad elements. Preprocessing is an important step to clean up the document before content extraction.

Parameters:

  • doc: The parsed HTML document to preprocess

Returns:

  • The same document after preprocessing (for method chaining)

func SerializeDocumentToHTML

func SerializeDocumentToHTML(doc *dom.VDocument) string

SerializeDocumentToHTML converts a virtual DOM document to an HTML string. This serializes an entire document, including the doctype and HTML structure.

Parameters:

  • doc: The VDocument to serialize

Returns:

  • An HTML string representation of the document

func SerializeDocumentToWriter

func SerializeDocumentToWriter(doc *dom.VDocument, w io.Writer) error

SerializeDocumentToWriter writes the HTML representation of a document to a writer. This serializes an entire document to a writer, which is useful for streaming HTML output to a file or response writer.

Parameters:

  • doc: The VDocument to serialize
  • w: The io.Writer to write to

Returns:

  • An error if writing fails

func SerializeToHTML

func SerializeToHTML(node dom.VNode) string

SerializeToHTML converts a virtual DOM element to an HTML string. This is useful for converting a VNode back to an HTML string after processing.

Parameters:

  • node: The VNode to serialize

Returns:

  • An HTML string representation of the node

func SerializeToWriter

func SerializeToWriter(node dom.VNode, w io.Writer) error

SerializeToWriter writes the HTML representation of a node to a writer. This is useful for streaming HTML output to a file or response writer.

Parameters:

  • node: The VNode to serialize
  • w: The io.Writer to write to

Returns:

  • An error if writing fails

func Stringify

func Stringify(element *dom.VElement) string

Stringify converts VElement to a readable string format. Removes tags while applying line breaks considering block and inline elements. Aligns all text to the shallowest indent. Merges consecutive line breaks into one.

Parameters:

  • element: The element to convert to a string

Returns:

  • A plain text representation of the element's content

func TextSimilarity

func TextSimilarity(textA, textB string) float64

TextSimilarity compares two texts and returns a similarity score between 0 and 1. 1 means identical texts, 0 means completely different texts. This is used to compare potential titles and other text elements to find the best match.

Parameters:

  • textA: The first text to compare
  • textB: The second text to compare

Returns:

  • A float64 similarity score between 0 and 1

func ToHTML

func ToHTML(element *dom.VElement) string

ToHTML generates HTML string from VElement, omitting span tags and class attributes. This produces a cleaner HTML representation of the extracted content by removing unnecessary styling and presentation elements.

Parameters:

  • element: The element to convert to HTML

Returns:

  • A string containing the HTML representation of the element

func ToMarkdown

func ToMarkdown(element *dom.VElement) string

ToMarkdown converts a VElement to a Markdown string. This is the main entry point for HTML to Markdown conversion, which produces a well-formatted Markdown document from an HTML element.

Parameters:

  • element: The HTML element to convert to Markdown

Returns:

  • A Markdown string representation of the element

func UnescapeHTMLEntities

func UnescapeHTMLEntities(str string) string

UnescapeHTMLEntities converts HTML entities to their corresponding characters. This handles both named entities like &amp; and numeric entities like &#39;.

Parameters:

  • str: The string containing HTML entities to unescape

Returns:

  • The unescaped string with entities converted to their character equivalents

Types

type AriaNode

type AriaNode struct {
	Type            AriaNodeType  // Type of the ARIA node
	Name            string        // Accessible name
	Role            string        // Explicit ARIA role
	Level           int           // Heading level, etc.
	Checked         *bool         // Checkbox state (pointer to allow nil for "not applicable")
	Selected        *bool         // Selection state
	Expanded        *bool         // Expansion state
	Disabled        *bool         // Disabled state
	Required        *bool         // Required state
	ValueMin        *float64      // Minimum value
	ValueMax        *float64      // Maximum value
	ValueText       string        // Text representation of value
	Children        []*AriaNode   // Child nodes
	OriginalElement *dom.VElement // Reference to the original DOM element
}

AriaNode represents a node in an accessibility tree. It contains information about the accessibility properties of an element, such as its role, name, state, and children, which is useful for understanding the semantic structure of a document from an accessibility perspective.

func BuildAriaNode

func BuildAriaNode(element *dom.VElement) *AriaNode

BuildAriaNode builds an AriaNode from a DOM element. This recursively constructs an accessibility tree node from a DOM element, including its properties and children.

Parameters:

  • element: The DOM element to build an AriaNode from

Returns:

  • An AriaNode representing the element and its children

func CompressAriaTree

func CompressAriaTree(node *AriaNode) *AriaNode

CompressAriaTree compresses an AriaTree by removing insignificant nodes, merging similar nodes, and simplifying the structure. This produces a more concise and meaningful representation of the document's accessibility structure.

Parameters:

  • node: The root node of the tree to compress

Returns:

  • The compressed tree's root node

type AriaNodeType

type AriaNodeType string

AriaNodeType represents the type of an ARIA node.

const (
	// ARIA landmark roles
	AriaNodeTypeBanner        AriaNodeType = "banner"
	AriaNodeTypeComplementary AriaNodeType = "complementary"
	AriaNodeTypeContentInfo   AriaNodeType = "contentinfo"
	AriaNodeTypeForm          AriaNodeType = "form"
	AriaNodeTypeMain          AriaNodeType = "main"
	AriaNodeTypeNavigation    AriaNodeType = "navigation"
	AriaNodeTypeRegion        AriaNodeType = "region"
	AriaNodeTypeSearch        AriaNodeType = "search"

	// ARIA widget roles
	AriaNodeTypeArticle      AriaNodeType = "article"
	AriaNodeTypeButton       AriaNodeType = "button"
	AriaNodeTypeCell         AriaNodeType = "cell"
	AriaNodeTypeCheckbox     AriaNodeType = "checkbox"
	AriaNodeTypeColumnHeader AriaNodeType = "columnheader"
	AriaNodeTypeCombobox     AriaNodeType = "combobox"
	AriaNodeTypeDialog       AriaNodeType = "dialog"
	AriaNodeTypeFigure       AriaNodeType = "figure"
	AriaNodeTypeGrid         AriaNodeType = "grid"
	AriaNodeTypeGridCell     AriaNodeType = "gridcell"
	AriaNodeTypeHeading      AriaNodeType = "heading"
	AriaNodeTypeImg          AriaNodeType = "img"
	AriaNodeTypeLink         AriaNodeType = "link"
	AriaNodeTypeList         AriaNodeType = "list"
	AriaNodeTypeListItem     AriaNodeType = "listitem"
	AriaNodeTypeMenuItem     AriaNodeType = "menuitem"
	AriaNodeTypeOption       AriaNodeType = "option"
	AriaNodeTypeProgressBar  AriaNodeType = "progressbar"
	AriaNodeTypeRadio        AriaNodeType = "radio"
	AriaNodeTypeRadioGroup   AriaNodeType = "radiogroup"
	AriaNodeTypeRow          AriaNodeType = "row"
	AriaNodeTypeRowGroup     AriaNodeType = "rowgroup"
	AriaNodeTypeRowHeader    AriaNodeType = "rowheader"
	AriaNodeTypeSearchBox    AriaNodeType = "searchbox"
	AriaNodeTypeSeparator    AriaNodeType = "separator"
	AriaNodeTypeSlider       AriaNodeType = "slider"
	AriaNodeTypeSpinButton   AriaNodeType = "spinbutton"
	AriaNodeTypeSwitch       AriaNodeType = "switch"
	AriaNodeTypeTab          AriaNodeType = "tab"
	AriaNodeTypeTable        AriaNodeType = "table"
	AriaNodeTypeTabList      AriaNodeType = "tablist"
	AriaNodeTypeTabPanel     AriaNodeType = "tabpanel"
	AriaNodeTypeTextBox      AriaNodeType = "textbox"
	AriaNodeTypeText         AriaNodeType = "text"
	AriaNodeTypeGeneric      AriaNodeType = "generic" // Any other role
)

ARIA node types

func GetAriaNodeType

func GetAriaNodeType(element *dom.VElement) AriaNodeType

GetAriaNodeType determines the AriaNodeType of an element based on its role. This maps ARIA roles to their corresponding AriaNodeType enum values.

Parameters:

  • element: The element to determine the node type for

Returns:

  • The AriaNodeType corresponding to the element's role

type AriaTree

type AriaTree struct {
	Root      *AriaNode // Root node of the ARIA tree
	NodeCount int       // Total number of nodes in the tree
}

AriaTree represents an accessibility tree. This is a hierarchical representation of a document's accessibility structure, which can be used as a fallback when traditional content extraction fails.

func BuildAriaTree

func BuildAriaTree(doc *dom.VDocument) *AriaTree

BuildAriaTree builds an AriaTree from a DOM document. This constructs a complete accessibility tree from a document, then compresses it to produce a more concise and meaningful representation.

Parameters:

  • doc: The DOM document to build an AriaTree from

Returns:

  • An AriaTree representing the document's accessibility structure

type ArticleContent

type ArticleContent struct {
	Title  string        // Extracted title
	Byline string        // Extracted byline/author
	Root   *dom.VElement // Main content root element
}

ArticleContent represents the content of an article page. This is a simplified view of ReadabilityArticle focused on article-specific content.

type OtherContent

type OtherContent struct {
	Title                 string          // Extracted title
	Header                *dom.VElement   // Page header, if identified
	Footer                *dom.VElement   // Page footer, if identified
	OtherSignificantNodes []*dom.VElement // Other semantically significant nodes
	AriaTree              *AriaTree       // ARIA tree representation
}

OtherContent represents the content of a non-article page. This is used for pages that don't fit the article pattern, such as index pages, landing pages, or other non-article content.

type PageType

type PageType string

PageType represents the type of a page (article, other, etc.) This is used to classify pages based on their content structure and characteristics.

const (
	// PageTypeArticle represents a standard article page
	PageTypeArticle PageType = "article"
	// PageTypeOther represents any page that is not a standard article (e.g., index, list, error)
	PageTypeOther PageType = "other"
)

func ClassifyPageType

func ClassifyPageType(
	doc *dom.VDocument,
	candidates []*dom.VElement,
	charThreshold int,
	url string,
) PageType

ClassifyPageType classifies a document as an article or other type of page. It uses various heuristics including URL pattern, semantic tags, text length, link density, and more to determine the page type. This classification helps the extraction process decide how to handle different types of content.

Parameters:

  • doc: The parsed HTML document
  • candidates: The list of content candidates found by the scoring algorithm
  • charThreshold: The minimum character threshold for article content
  • url: The URL of the page (optional, used for URL pattern analysis)

Returns:

  • PageType: Either PageTypeArticle or PageTypeOther

func GetExpectedPageTypeByUrl

func GetExpectedPageTypeByUrl(url string) PageType

GetExpectedPageTypeByUrl determines the expected page type based on URL patterns. This is a helper function that can be used before full page analysis to get a preliminary classification based solely on URL patterns.

Parameters:

  • url: The URL of the page to analyze

Returns:

  • PageType: Either PageTypeArticle or PageTypeOther based on URL patterns

type ReadabilityArticle

type ReadabilityArticle struct {
	Title     string        // Extracted title
	Byline    string        // Extracted byline/author information
	Root      *dom.VElement // Main content root element (if score threshold is met)
	NodeCount int           // Total number of nodes
	PageType  PageType      // Classification of page type

	// Structural elements (set when PageType is ARTICLE but Root is nil)
	Header                *dom.VElement   // Page header element, if identified
	Footer                *dom.VElement   // Page footer element, if identified
	OtherSignificantNodes []*dom.VElement // Other semantically significant nodes

	// Fallback when article extraction fails
	AriaTree *AriaTree // ARIA tree representation
}

ReadabilityArticle represents the result of a readability extraction. It contains the extracted content, metadata, and structural information about the page.

func Extract

func Extract(html string, options ReadabilityOptions) (ReadabilityArticle, error)

Extract extracts the article content from HTML. This is the main entry point for the readability extraction process. It parses the HTML, preprocesses the document, and extracts the main content based on the provided options.

Parameters:

  • html: The HTML string to extract content from
  • options: Configuration options for the extraction process

Returns:

  • A ReadabilityArticle containing the extracted content and metadata
  • An error if the HTML parsing fails

func ExtractContent

func ExtractContent(doc *dom.VDocument, options ReadabilityOptions) ReadabilityArticle

ExtractContent extracts the main content from a document. This is the core function for content extraction that implements the main readability algorithm to identify and extract the primary content.

Parameters:

  • doc: The parsed HTML document as a VDocument
  • options: Configuration options for the extraction process

Returns:

  • A ReadabilityArticle containing the extracted content and metadata

func (*ReadabilityArticle) GetContentByPageType

func (r *ReadabilityArticle) GetContentByPageType() interface{}

GetContentByPageType returns the appropriate content structure based on page type. It returns either ArticleContent or OtherContent depending on the page type. This allows consumers to handle different page types with type-specific structures.

Returns:

  • ArticleContent if the page is classified as an article
  • OtherContent if the page is classified as any other type

type ReadabilityMetadata

type ReadabilityMetadata struct {
	Title         string
	Byline        string
	Excerpt       string
	SiteName      string
	PublishedTime string
}

ReadabilityMetadata represents metadata extracted from a document. It contains information like title, author, excerpt, site name, and publication date that helps identify and contextualize the content.

func GetJSONLD

func GetJSONLD(doc *dom.VDocument) ReadabilityMetadata

GetJSONLD extracts metadata from JSON-LD objects in the document. It currently only supports Schema.org objects of type Article or its subtypes. JSON-LD is a structured data format that provides rich metadata about web content.

Parameters:

  • doc: The parsed HTML document

Returns:

  • ReadabilityMetadata containing information extracted from JSON-LD

type ReadabilityOptions

type ReadabilityOptions struct {
	// CharThreshold is the minimum number of characters an article must have
	CharThreshold int
	// NbTopCandidates is the number of top candidates to consider
	NbTopCandidates int
	// GenerateAriaTree indicates whether to generate ARIA tree representation
	GenerateAriaTree bool
	// ForcedPageType allows forcing a specific page type classification
	ForcedPageType PageType
}

ReadabilityOptions contains configuration options for the readability extraction process. These options control various aspects of the content extraction algorithm, such as thresholds, candidate selection, and output format.

func DefaultOptions

func DefaultOptions() ReadabilityOptions

DefaultOptions returns a ReadabilityOptions struct with default values. This provides a convenient way to get a pre-configured options object with reasonable defaults for most extraction scenarios.

Returns:

  • A ReadabilityOptions struct initialized with default values

Directories

Path Synopsis
cmd
internal
dom
Package dom provides virtual DOM structures and operations for HTML parsing and manipulation.
Package dom provides virtual DOM structures and operations for HTML parsing and manipulation.
parser
Package parser provides HTML parsing functionality for the readability library.
Package parser provides HTML parsing functionality for the readability library.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL