The html library provides functions to read conformant HTML4 documents and structures to represent them. Since html assumes documents are conformant and is restricted to the older specific, it should be viewed as a legacy library. We suggest using the html-parsing package for modern Web scraping.
port : input-port?
port : input-port?
Reads (X)HTML from a port, producing an html instance.
Reads HTML from a port, producing a list of XML content, each of which could be turned into an X-expression, if necessary, with xml->xexpr.
If v is not #f, then comments are read and returned. Defaults to #f.
If v is not #f, then the HTML must respect the HTML specification with regards to what elements are allowed to be the children of other elements. For example, the top-level "<html>" element may only contain a "<body>" and "<head>" element. Defaults to #t.
(module html-example racket ; Some of the symbols in html and xml conflict with ; each other and with racket/base language, so we prefix ; to avoid namespace conflict. (require (prefix-in h: html) (prefix-in x: xml)) (define an-html (h:read-xhtml (open-input-string (string-append "<html><head><title>My title</title></head><body>" "<p>Hello world</p><p><b>Testing</b>!</p>" "</body></html>")))) ; extract-pcdata: html-content/c -> (listof string) ; Pulls out the pcdata strings from some-content. (define (extract-pcdata some-content) (cond [(x:pcdata? some-content) (list (x:pcdata-string some-content))] [(x:entity? some-content) (list)] [else (extract-pcdata-from-element some-content)])) ; extract-pcdata-from-element: html-element -> (listof string) ; Pulls out the pcdata strings from an-html-element. (define (extract-pcdata-from-element an-html-element) (match an-html-element [(struct h:html-full (attributes content)) (apply append (map extract-pcdata content))] [(struct h:html-element (attributes)) '()])) (printf "~s\n" (extract-pcdata an-html)))
> (require 'html-example)
("My title" "Hello world" "Testing" "!")
A html-content/c is either
(struct html-element (attributes) #:extra-constructor-name make-html-element) attributes : (listof attribute)
Any of the structures below inherits from html-element.
(struct html-full struct:html-element (content) #:extra-constructor-name make-html-full) content : (listof html-content/c)
Any html tag that may include content also inherits from html-full without adding any additional fields.
A mzscheme is special legacy value for the old documentation system.
A Contents-of-html is either
A Contents-of-head is either
A Contents-of-tr is either
A Contents-of-table is either
A Contents-of-fieldset is either
A Contents-of-select is either
A Contents-of-dl is either
A Contents-of-pre is either
A Contents-of-object-applet is either
A Contents-of-map is either
A Contents-of-a is either
A Contents-of-address is either
A Contents-of-body is either
A G12 is either
A G11 is either
A G10 is either
A G9 is either
A G8 is either
A G7 is either
A G6 is either
A G5 is either
A G4 is either
A G3 is either
A G2 is either