Documentation
¶
Overview ¶
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Package readability provides functionality to extract readable content from HTML documents. It implements an algorithm similar to Mozilla's Readability.js to identify and extract the main content from web pages, removing clutter, navigation, ads, and other non-content elements.
Index ¶
- func AddSignificantElementsByClassOrId(body *dom.VElement, potentialNodes *[]*dom.VElement)
- func AnalyzeUrlPattern(url string) string
- func AriaTreeToString(tree *AriaTree) string
- func CountAriaNodes(node *AriaNode) int
- func CountNodes(element *dom.VElement) int
- func CreateElement(tagName string) *dom.VElement
- func CreateExtractor(options ReadabilityOptions) func(string) (ReadabilityArticle, error)
- func CreateTextNode(content string) *dom.VText
- func ExtractTextContent(element *dom.VElement) string
- func FindMainCandidates(doc *dom.VDocument, nbTopCandidates int) []*dom.VElement
- func FindStructuralElements(doc *dom.VDocument) (header *dom.VElement, footer *dom.VElement, ...)
- func FormatDocument(text string) string
- func GetAccessibleName(element *dom.VElement) string
- func GetAriaRole(element *dom.VElement) string
- func GetArticleByline(doc *dom.VDocument) string
- func GetArticleTitle(doc *dom.VDocument) string
- func GetAttribute(element *dom.VElement, name string) string
- func GetClassWeight(node *dom.VElement) float64
- func GetElementsByTagName(element *dom.VElement, tagName string) []*dom.VElement
- func GetElementsByTagNames(element *dom.VElement, tagNames []string) []*dom.VElement
- func GetInnerText(node dom.VNode, normalizeSpaces bool) string
- func GetLinkDensity(element *dom.VElement) float64
- func GetNodeAncestors(node *dom.VElement, maxDepth int) []*dom.VElement
- func GetTextDensity(element *dom.VElement) float64
- func HasAncestorTag(node dom.VNode, tagName string, maxDepth int) bool
- func InitializeNode(node *dom.VElement)
- func IsProbablyContent(element *dom.VElement) bool
- func IsProbablyVisible(node *dom.VElement) bool
- func IsSemanticTag(element *dom.VElement) bool
- func IsSignificantNode(node *dom.VElement) bool
- func IsURL(str string) bool
- func ParseHTML(htmlContent string, baseURI string) (*dom.VDocument, error)
- func PreprocessDocument(doc *dom.VDocument) *dom.VDocument
- func SerializeDocumentToHTML(doc *dom.VDocument) string
- func SerializeDocumentToWriter(doc *dom.VDocument, w io.Writer) error
- func SerializeToHTML(node dom.VNode) string
- func SerializeToWriter(node dom.VNode, w io.Writer) error
- func Stringify(element *dom.VElement) string
- func TextSimilarity(textA, textB string) float64
- func ToHTML(element *dom.VElement) string
- func ToMarkdown(element *dom.VElement) string
- func UnescapeHTMLEntities(str string) string
- type AriaNode
- type AriaNodeType
- type AriaTree
- type ArticleContent
- type OtherContent
- type PageType
- type ReadabilityArticle
- type ReadabilityMetadata
- type ReadabilityOptions
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func AddSignificantElementsByClassOrId ¶
AddSignificantElementsByClassOrId detects elements with meaningful class names or IDs and adds them to the potentialNodes slice. This helps identify content containers that might not use semantic HTML tags but follow common naming conventions.
Parameters:
- body: The body element to search within
- potentialNodes: A pointer to a slice where identified elements will be added
func AnalyzeUrlPattern ¶
AnalyzeUrlPattern analyzes the pattern of the URL's last part. This is a helper function for debugging and understanding URL patterns. It categorizes the last part of a URL into patterns like "numeric only", "alphanumeric", etc.
Parameters:
- url: The URL to analyze
Returns:
- A string describing the pattern of the URL's last part
func AriaTreeToString ¶
AriaTreeToString converts an AriaTree to a string representation. This is useful for debugging and visualizing the accessibility structure of a document.
Parameters:
- tree: The AriaTree to convert to a string
Returns:
- A string representation of the tree
func CountAriaNodes ¶
CountAriaNodes counts the total number of nodes in an AriaNode tree. This includes the node itself and all its descendants.
Parameters:
- node: The root node to count from
Returns:
- The total number of nodes in the tree
func CountNodes ¶
CountNodes counts the number of nodes within a VElement. This includes the element itself and all its descendants (both elements and text nodes).
Parameters:
- element: The element to count nodes for
Returns:
- The total number of nodes
func CreateElement ¶
CreateElement creates a new element with the given tag name. This is useful for creating new elements to insert into the DOM.
Parameters:
- tagName: The tag name for the new element
Returns:
- A new VElement with the specified tag name
func CreateExtractor ¶
func CreateExtractor(options ReadabilityOptions) func(string) (ReadabilityArticle, error)
CreateExtractor creates a custom extractor function with specific options. This is useful when you want to reuse the same extraction configuration multiple times. The returned function can be called with HTML strings to extract content using the predefined options.
Parameters:
- options: The readability options to use for all extractions
Returns:
- A function that takes an HTML string and returns a ReadabilityArticle and error
func CreateTextNode ¶
CreateTextNode creates a new text node with the given content. This is useful for creating text nodes to insert into the DOM.
Parameters:
- content: The text content for the new node
Returns:
- A new VText node with the specified content
func ExtractTextContent ¶
ExtractTextContent extracts text content from VElement. This returns only the text nodes' content, without any HTML formatting.
Parameters:
- element: The element to extract text from
Returns:
- A string containing all text content from the element and its descendants
func FindMainCandidates ¶
FindMainCandidates detects nodes that are likely to be the main content candidates, sorted by score. It implements the core scoring algorithm of readability, analyzing elements based on content length, tag types, class names, and other heuristics to identify the most likely content containers.
Parameters:
- doc: The parsed HTML document
- nbTopCandidates: The number of top candidates to return
Returns:
- A slice of the top N candidate elements, sorted by score in descending order
func FindStructuralElements ¶
func FindStructuralElements(doc *dom.VDocument) ( header *dom.VElement, footer *dom.VElement, otherSignificantNodes []*dom.VElement, )
FindStructuralElements detects header, footer, and other significant structural elements in a document. This is particularly useful for pages that are classified as articles but where the main content extraction fails to meet the threshold. It uses semantic tags, ARIA roles, and common class/ID patterns to identify important page structures.
Parameters:
- doc: The parsed HTML document
Returns:
- header: The identified page header element, if found
- footer: The identified page footer element, if found
- otherSignificantNodes: Other semantically significant elements found in the document
func FormatDocument ¶
FormatDocument formats the entire document. Merges consecutive line breaks into one, removes extra line breaks at the beginning and end. This produces a cleaner, more readable text output.
Parameters:
- text: The text to format
Returns:
- The formatted text
func GetAccessibleName ¶
GetAccessibleName returns the accessible name of an element. It follows the accessible name calculation algorithm, prioritizing aria-label, aria-labelledby, alt, title, and text content. The accessible name is what would be announced by screen readers and other assistive technologies.
Parameters:
- element: The element to get the accessible name for
Returns:
- The accessible name as a string
func GetAriaRole ¶
GetAriaRole returns the ARIA role of an element. It returns the explicit role attribute or an implicit role based on the tag name. ARIA roles provide semantic meaning to elements for accessibility purposes.
Parameters:
- element: The element to get the role for
Returns:
- The ARIA role as a string
func GetArticleByline ¶
GetArticleByline extracts the author information from the document. It uses various strategies including meta tags and JSON-LD data to find the author or byline information associated with the content.
Parameters:
- doc: The parsed HTML document
Returns:
- The extracted author/byline information as a string
func GetArticleTitle ¶
GetArticleTitle extracts the article title from the document. It tries various strategies to find the most appropriate title, including examining the <title> element, heading elements, and handling common title patterns like site name separators.
Parameters:
- doc: The parsed HTML document
Returns:
- The extracted article title as a string
func GetAttribute ¶
GetAttribute gets the value of an attribute on an element. Returns an empty string if the attribute doesn't exist.
Parameters:
- element: The element to get the attribute from
- name: The name of the attribute to get
Returns:
- The attribute value, or an empty string if not found
func GetClassWeight ¶
GetClassWeight calculates a score adjustment based on the class name and ID of an element. It returns a positive score for elements likely to contain content and a negative score for elements likely to be noise. This helps the algorithm prioritize content-rich elements and deprioritize elements that typically contain non-content material.
Parameters:
- node: The element to calculate a class weight for
Returns:
- A float64 score adjustment (positive for likely content, negative for likely noise)
func GetElementsByTagName ¶
GetElementsByTagName returns all elements with the specified tag name in the element tree. If tagName is "*", it returns all elements.
Parameters:
- element: The root element to search from
- tagName: The tag name to search for, or "*" for all elements
Returns:
- A slice of elements matching the tag name
func GetElementsByTagNames ¶
GetElementsByTagNames returns all elements with any of the specified tag names in the element tree. This is useful for finding elements of multiple types in a single pass.
Parameters:
- element: The root element to search from
- tagNames: A slice of tag names to search for
Returns:
- A slice of elements matching any of the tag names
func GetInnerText ¶
GetInnerText returns the inner text of an element or text node. If normalizeSpaces is true, consecutive whitespace is normalized to a single space. This extracts all text content from an element and its descendants.
Parameters:
- node: The node to get text from
- normalizeSpaces: Whether to normalize whitespace
Returns:
- The combined text content of the node and its descendants
func GetLinkDensity ¶
GetLinkDensity calculates the ratio of link text to all text in an element. Returns a value between 0 and 1, where higher values indicate more links. This is useful for identifying navigation areas and other link-heavy sections that are unlikely to be main content.
Parameters:
- element: The element to calculate link density for
Returns:
- A float64 between 0 and 1 representing the link density
func GetNodeAncestors ¶
GetNodeAncestors returns the ancestor elements of a node up to a specified depth. If maxDepth is less than or equal to 0, all ancestors are returned. This is useful for traversing up the DOM tree to find parent elements.
Parameters:
- node: The element to get ancestors for
- maxDepth: The maximum number of ancestors to return, or <= 0 for all
Returns:
- A slice of ancestor elements, ordered from closest to furthest
func GetTextDensity ¶
GetTextDensity calculates the ratio of text to child elements in an element. Returns a value where higher values indicate more text-dense content. This helps identify content-rich elements that are likely to be the main content.
Parameters:
- element: The element to calculate text density for
Returns:
- A float64 representing the text density
func HasAncestorTag ¶
HasAncestorTag checks if a node has an ancestor with the specified tag name. If maxDepth is less than or equal to 0, all ancestors are checked. This is useful for determining if an element is contained within a specific type of element.
Parameters:
- node: The node to check ancestors for
- tagName: The tag name to look for in ancestors
- maxDepth: The maximum depth to check, or <= 0 for unlimited
Returns:
- true if an ancestor with the specified tag name is found, false otherwise
func InitializeNode ¶
InitializeNode initializes a node with a readability score. It sets an initial score based on the tag name and adjusts it based on class name and ID. This is a key part of the content scoring algorithm, establishing baseline scores for different HTML elements.
Parameters:
- node: The element to initialize with a readability score
func IsProbablyContent ¶
IsProbablyContent determines content probability (simplified version similar to isProbablyReaderable). It checks various properties of an element to determine if it's likely to contain meaningful content, including visibility, class/ID patterns, text length, and link density.
Parameters:
- element: The element to evaluate
Returns:
- true if the element is likely to contain meaningful content, false otherwise
func IsProbablyVisible ¶
IsProbablyVisible checks if an element is likely to be visible based on its attributes. This helps filter out hidden elements that shouldn't be included in the extracted content.
Parameters:
- node: The element to check
Returns:
- true if the element is likely visible, false otherwise
func IsSemanticTag ¶
IsSemanticTag checks if an element is a semantic tag or contains semantic tags. Semantic tags include main, article, and elements with content-related classes/IDs. These tags provide structural meaning to the content and are strong indicators of meaningful content areas.
Parameters:
- element: The element to check
Returns:
- true if the element is or contains semantic tags, false otherwise
func IsSignificantNode ¶
IsSignificantNode determines if a node is semantically significant. This includes elements like header, footer, main, article, etc. Significant nodes are important structural elements that help understand the page's organization even when the main content extraction fails.
Parameters:
- node: The element to check
Returns:
- true if the node is semantically significant, false otherwise
func IsURL ¶
IsURL checks if a string is a valid URL. This is a simple validation function that checks if a string starts with http:// or https:// to determine if it's likely a URL.
Parameters:
- str: The string to check
Returns:
- true if the string appears to be a URL, false otherwise
func ParseHTML ¶
ParseHTML parses an HTML string and returns a virtual DOM document. It uses golang.org/x/net/html for parsing and converts the result to our internal DOM structure. The baseURI parameter is used to resolve relative URLs in the document.
Parameters:
- htmlContent: The HTML string to parse
- baseURI: The base URI for resolving relative URLs (can be empty)
Returns:
- A pointer to a VDocument representing the parsed HTML
- An error if parsing fails
func PreprocessDocument ¶
PreprocessDocument removes noise elements from the document. This includes removing semantic tags, unnecessary tags, and ad elements. Preprocessing is an important step to clean up the document before content extraction.
Parameters:
- doc: The parsed HTML document to preprocess
Returns:
- The same document after preprocessing (for method chaining)
func SerializeDocumentToHTML ¶
SerializeDocumentToHTML converts a virtual DOM document to an HTML string. This serializes an entire document, including the doctype and HTML structure.
Parameters:
- doc: The VDocument to serialize
Returns:
- An HTML string representation of the document
func SerializeDocumentToWriter ¶
SerializeDocumentToWriter writes the HTML representation of a document to a writer. This serializes an entire document to a writer, which is useful for streaming HTML output to a file or response writer.
Parameters:
- doc: The VDocument to serialize
- w: The io.Writer to write to
Returns:
- An error if writing fails
func SerializeToHTML ¶
SerializeToHTML converts a virtual DOM element to an HTML string. This is useful for converting a VNode back to an HTML string after processing.
Parameters:
- node: The VNode to serialize
Returns:
- An HTML string representation of the node
func SerializeToWriter ¶
SerializeToWriter writes the HTML representation of a node to a writer. This is useful for streaming HTML output to a file or response writer.
Parameters:
- node: The VNode to serialize
- w: The io.Writer to write to
Returns:
- An error if writing fails
func Stringify ¶
Stringify converts VElement to a readable string format. Removes tags while applying line breaks considering block and inline elements. Aligns all text to the shallowest indent. Merges consecutive line breaks into one.
Parameters:
- element: The element to convert to a string
Returns:
- A plain text representation of the element's content
func TextSimilarity ¶
TextSimilarity compares two texts and returns a similarity score between 0 and 1. 1 means identical texts, 0 means completely different texts. This is used to compare potential titles and other text elements to find the best match.
Parameters:
- textA: The first text to compare
- textB: The second text to compare
Returns:
- A float64 similarity score between 0 and 1
func ToHTML ¶
ToHTML generates HTML string from VElement, omitting span tags and class attributes. This produces a cleaner HTML representation of the extracted content by removing unnecessary styling and presentation elements.
Parameters:
- element: The element to convert to HTML
Returns:
- A string containing the HTML representation of the element
func ToMarkdown ¶
ToMarkdown converts a VElement to a Markdown string. This is the main entry point for HTML to Markdown conversion, which produces a well-formatted Markdown document from an HTML element.
Parameters:
- element: The HTML element to convert to Markdown
Returns:
- A Markdown string representation of the element
func UnescapeHTMLEntities ¶
UnescapeHTMLEntities converts HTML entities to their corresponding characters. This handles both named entities like & and numeric entities like '.
Parameters:
- str: The string containing HTML entities to unescape
Returns:
- The unescaped string with entities converted to their character equivalents
Types ¶
type AriaNode ¶
type AriaNode struct { Type AriaNodeType // Type of the ARIA node Name string // Accessible name Role string // Explicit ARIA role Level int // Heading level, etc. Checked *bool // Checkbox state (pointer to allow nil for "not applicable") Selected *bool // Selection state Expanded *bool // Expansion state Disabled *bool // Disabled state Required *bool // Required state ValueMin *float64 // Minimum value ValueMax *float64 // Maximum value ValueText string // Text representation of value Children []*AriaNode // Child nodes OriginalElement *dom.VElement // Reference to the original DOM element }
AriaNode represents a node in an accessibility tree. It contains information about the accessibility properties of an element, such as its role, name, state, and children, which is useful for understanding the semantic structure of a document from an accessibility perspective.
func BuildAriaNode ¶
BuildAriaNode builds an AriaNode from a DOM element. This recursively constructs an accessibility tree node from a DOM element, including its properties and children.
Parameters:
- element: The DOM element to build an AriaNode from
Returns:
- An AriaNode representing the element and its children
func CompressAriaTree ¶
CompressAriaTree compresses an AriaTree by removing insignificant nodes, merging similar nodes, and simplifying the structure. This produces a more concise and meaningful representation of the document's accessibility structure.
Parameters:
- node: The root node of the tree to compress
Returns:
- The compressed tree's root node
type AriaNodeType ¶
type AriaNodeType string
AriaNodeType represents the type of an ARIA node.
const ( // ARIA landmark roles AriaNodeTypeBanner AriaNodeType = "banner" AriaNodeTypeComplementary AriaNodeType = "complementary" AriaNodeTypeContentInfo AriaNodeType = "contentinfo" AriaNodeTypeForm AriaNodeType = "form" AriaNodeTypeMain AriaNodeType = "main" AriaNodeTypeRegion AriaNodeType = "region" AriaNodeTypeSearch AriaNodeType = "search" // ARIA widget roles AriaNodeTypeArticle AriaNodeType = "article" AriaNodeTypeButton AriaNodeType = "button" AriaNodeTypeCell AriaNodeType = "cell" AriaNodeTypeCheckbox AriaNodeType = "checkbox" AriaNodeTypeColumnHeader AriaNodeType = "columnheader" AriaNodeTypeCombobox AriaNodeType = "combobox" AriaNodeTypeDialog AriaNodeType = "dialog" AriaNodeTypeFigure AriaNodeType = "figure" AriaNodeTypeGrid AriaNodeType = "grid" AriaNodeTypeGridCell AriaNodeType = "gridcell" AriaNodeTypeHeading AriaNodeType = "heading" AriaNodeTypeImg AriaNodeType = "img" AriaNodeTypeLink AriaNodeType = "link" AriaNodeTypeList AriaNodeType = "list" AriaNodeTypeListItem AriaNodeType = "listitem" AriaNodeTypeMenuItem AriaNodeType = "menuitem" AriaNodeTypeOption AriaNodeType = "option" AriaNodeTypeProgressBar AriaNodeType = "progressbar" AriaNodeTypeRadio AriaNodeType = "radio" AriaNodeTypeRadioGroup AriaNodeType = "radiogroup" AriaNodeTypeRow AriaNodeType = "row" AriaNodeTypeRowGroup AriaNodeType = "rowgroup" AriaNodeTypeRowHeader AriaNodeType = "rowheader" AriaNodeTypeSearchBox AriaNodeType = "searchbox" AriaNodeTypeSeparator AriaNodeType = "separator" AriaNodeTypeSlider AriaNodeType = "slider" AriaNodeTypeSpinButton AriaNodeType = "spinbutton" AriaNodeTypeSwitch AriaNodeType = "switch" AriaNodeTypeTab AriaNodeType = "tab" AriaNodeTypeTable AriaNodeType = "table" AriaNodeTypeTabList AriaNodeType = "tablist" AriaNodeTypeTabPanel AriaNodeType = "tabpanel" AriaNodeTypeTextBox AriaNodeType = "textbox" AriaNodeTypeText AriaNodeType = "text" AriaNodeTypeGeneric AriaNodeType = "generic" // Any other role )
ARIA node types
func GetAriaNodeType ¶
func GetAriaNodeType(element *dom.VElement) AriaNodeType
GetAriaNodeType determines the AriaNodeType of an element based on its role. This maps ARIA roles to their corresponding AriaNodeType enum values.
Parameters:
- element: The element to determine the node type for
Returns:
- The AriaNodeType corresponding to the element's role
type AriaTree ¶
type AriaTree struct { Root *AriaNode // Root node of the ARIA tree NodeCount int // Total number of nodes in the tree }
AriaTree represents an accessibility tree. This is a hierarchical representation of a document's accessibility structure, which can be used as a fallback when traditional content extraction fails.
func BuildAriaTree ¶
BuildAriaTree builds an AriaTree from a DOM document. This constructs a complete accessibility tree from a document, then compresses it to produce a more concise and meaningful representation.
Parameters:
- doc: The DOM document to build an AriaTree from
Returns:
- An AriaTree representing the document's accessibility structure
type ArticleContent ¶
type ArticleContent struct { Title string // Extracted title Byline string // Extracted byline/author Root *dom.VElement // Main content root element }
ArticleContent represents the content of an article page. This is a simplified view of ReadabilityArticle focused on article-specific content.
type OtherContent ¶
type OtherContent struct { Title string // Extracted title Header *dom.VElement // Page header, if identified OtherSignificantNodes []*dom.VElement // Other semantically significant nodes AriaTree *AriaTree // ARIA tree representation }
OtherContent represents the content of a non-article page. This is used for pages that don't fit the article pattern, such as index pages, landing pages, or other non-article content.
type PageType ¶
type PageType string
PageType represents the type of a page (article, other, etc.) This is used to classify pages based on their content structure and characteristics.
func ClassifyPageType ¶
func ClassifyPageType( doc *dom.VDocument, candidates []*dom.VElement, charThreshold int, url string, ) PageType
ClassifyPageType classifies a document as an article or other type of page. It uses various heuristics including URL pattern, semantic tags, text length, link density, and more to determine the page type. This classification helps the extraction process decide how to handle different types of content.
Parameters:
- doc: The parsed HTML document
- candidates: The list of content candidates found by the scoring algorithm
- charThreshold: The minimum character threshold for article content
- url: The URL of the page (optional, used for URL pattern analysis)
Returns:
- PageType: Either PageTypeArticle or PageTypeOther
func GetExpectedPageTypeByUrl ¶
GetExpectedPageTypeByUrl determines the expected page type based on URL patterns. This is a helper function that can be used before full page analysis to get a preliminary classification based solely on URL patterns.
Parameters:
- url: The URL of the page to analyze
Returns:
- PageType: Either PageTypeArticle or PageTypeOther based on URL patterns
type ReadabilityArticle ¶
type ReadabilityArticle struct { Title string // Extracted title Byline string // Extracted byline/author information Root *dom.VElement // Main content root element (if score threshold is met) NodeCount int // Total number of nodes PageType PageType // Classification of page type // Structural elements (set when PageType is ARTICLE but Root is nil) Header *dom.VElement // Page header element, if identified OtherSignificantNodes []*dom.VElement // Other semantically significant nodes // Fallback when article extraction fails AriaTree *AriaTree // ARIA tree representation }
ReadabilityArticle represents the result of a readability extraction. It contains the extracted content, metadata, and structural information about the page.
func Extract ¶
func Extract(html string, options ReadabilityOptions) (ReadabilityArticle, error)
Extract extracts the article content from HTML. This is the main entry point for the readability extraction process. It parses the HTML, preprocesses the document, and extracts the main content based on the provided options.
Parameters:
- html: The HTML string to extract content from
- options: Configuration options for the extraction process
Returns:
- A ReadabilityArticle containing the extracted content and metadata
- An error if the HTML parsing fails
func ExtractContent ¶
func ExtractContent(doc *dom.VDocument, options ReadabilityOptions) ReadabilityArticle
ExtractContent extracts the main content from a document. This is the core function for content extraction that implements the main readability algorithm to identify and extract the primary content.
Parameters:
- doc: The parsed HTML document as a VDocument
- options: Configuration options for the extraction process
Returns:
- A ReadabilityArticle containing the extracted content and metadata
func (*ReadabilityArticle) GetContentByPageType ¶
func (r *ReadabilityArticle) GetContentByPageType() interface{}
GetContentByPageType returns the appropriate content structure based on page type. It returns either ArticleContent or OtherContent depending on the page type. This allows consumers to handle different page types with type-specific structures.
Returns:
- ArticleContent if the page is classified as an article
- OtherContent if the page is classified as any other type
type ReadabilityMetadata ¶
type ReadabilityMetadata struct { Title string Byline string Excerpt string SiteName string PublishedTime string }
ReadabilityMetadata represents metadata extracted from a document. It contains information like title, author, excerpt, site name, and publication date that helps identify and contextualize the content.
func GetJSONLD ¶
func GetJSONLD(doc *dom.VDocument) ReadabilityMetadata
GetJSONLD extracts metadata from JSON-LD objects in the document. It currently only supports Schema.org objects of type Article or its subtypes. JSON-LD is a structured data format that provides rich metadata about web content.
Parameters:
- doc: The parsed HTML document
Returns:
- ReadabilityMetadata containing information extracted from JSON-LD
type ReadabilityOptions ¶
type ReadabilityOptions struct { // CharThreshold is the minimum number of characters an article must have CharThreshold int // NbTopCandidates is the number of top candidates to consider NbTopCandidates int // GenerateAriaTree indicates whether to generate ARIA tree representation GenerateAriaTree bool // ForcedPageType allows forcing a specific page type classification ForcedPageType PageType }
ReadabilityOptions contains configuration options for the readability extraction process. These options control various aspects of the content extraction algorithm, such as thresholds, candidate selection, and output format.
func DefaultOptions ¶
func DefaultOptions() ReadabilityOptions
DefaultOptions returns a ReadabilityOptions struct with default values. This provides a convenient way to get a pre-configured options object with reasonable defaults for most extraction scenarios.
Returns:
- A ReadabilityOptions struct initialized with default values
Source Files
¶
Directories
¶
Path | Synopsis |
---|---|
cmd
|
|
internal
|
|
dom
Package dom provides virtual DOM structures and operations for HTML parsing and manipulation.
|
Package dom provides virtual DOM structures and operations for HTML parsing and manipulation. |
parser
Package parser provides HTML parsing functionality for the readability library.
|
Package parser provides HTML parsing functionality for the readability library. |