parse html file, obtaining data from nested categories hierarchy using xslt 3
up vote
0
down vote
favorite
given the following html file:
http://bpeck.com/references/DDC/ddc_mine900.htm
http://bpeck.com/references/DDC/ddc_mine200.htm
http://bpeck.com/references/DDC/ddc_mine500.htm
etc,
how i could get an output that would show me the hierarchy of the categories?
/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------
Ideally if the output result could be in json format, but i guess xml could do.
Struggled with a serial parser (SAX), but failed, looking for an elegant solution.
main categories
900 World History
910 Geography and travel [see area subdivisions]
920 Biography, genealogy, insignia
930 History of the ancient world
940 General history of Europe [check schedules for date subdivisions]
950 General history of Asia, Far East
etc...
subcategories of 900:
900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons
...
example of a sub sub category found under 909 World history:
909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-
output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8
xslt-3.0
add a comment |
up vote
0
down vote
favorite
given the following html file:
http://bpeck.com/references/DDC/ddc_mine900.htm
http://bpeck.com/references/DDC/ddc_mine200.htm
http://bpeck.com/references/DDC/ddc_mine500.htm
etc,
how i could get an output that would show me the hierarchy of the categories?
/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------
Ideally if the output result could be in json format, but i guess xml could do.
Struggled with a serial parser (SAX), but failed, looking for an elegant solution.
main categories
900 World History
910 Geography and travel [see area subdivisions]
920 Biography, genealogy, insignia
930 History of the ancient world
940 General history of Europe [check schedules for date subdivisions]
950 General history of Asia, Far East
etc...
subcategories of 900:
900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons
...
example of a sub sub category found under 909 World history:
909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-
output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8
xslt-3.0
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get theID
values from, what do you consider a main category, a sub category, and so on?
– Martin Honnen
Nov 22 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option-x
to name a parser so if TagSoup is on the class path then-x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
– Martin Honnen
Nov 22 at 16:39
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
given the following html file:
http://bpeck.com/references/DDC/ddc_mine900.htm
http://bpeck.com/references/DDC/ddc_mine200.htm
http://bpeck.com/references/DDC/ddc_mine500.htm
etc,
how i could get an output that would show me the hierarchy of the categories?
/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------
Ideally if the output result could be in json format, but i guess xml could do.
Struggled with a serial parser (SAX), but failed, looking for an elegant solution.
main categories
900 World History
910 Geography and travel [see area subdivisions]
920 Biography, genealogy, insignia
930 History of the ancient world
940 General history of Europe [check schedules for date subdivisions]
950 General history of Asia, Far East
etc...
subcategories of 900:
900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons
...
example of a sub sub category found under 909 World history:
909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-
output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8
xslt-3.0
given the following html file:
http://bpeck.com/references/DDC/ddc_mine900.htm
http://bpeck.com/references/DDC/ddc_mine200.htm
http://bpeck.com/references/DDC/ddc_mine500.htm
etc,
how i could get an output that would show me the hierarchy of the categories?
/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------
Ideally if the output result could be in json format, but i guess xml could do.
Struggled with a serial parser (SAX), but failed, looking for an elegant solution.
main categories
900 World History
910 Geography and travel [see area subdivisions]
920 Biography, genealogy, insignia
930 History of the ancient world
940 General history of Europe [check schedules for date subdivisions]
950 General history of Asia, Far East
etc...
subcategories of 900:
900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons
...
example of a sub sub category found under 909 World history:
909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-
output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8
xslt-3.0
xslt-3.0
edited Nov 22 at 16:28
asked Nov 22 at 15:02
saloda
32
32
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get theID
values from, what do you consider a main category, a sub category, and so on?
– Martin Honnen
Nov 22 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option-x
to name a parser so if TagSoup is on the class path then-x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
– Martin Honnen
Nov 22 at 16:39
add a comment |
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get theID
values from, what do you consider a main category, a sub category, and so on?
– Martin Honnen
Nov 22 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option-x
to name a parser so if TagSoup is on the class path then-x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
– Martin Honnen
Nov 22 at 16:39
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the
ID
values from, what do you consider a main category, a sub category, and so on?– Martin Honnen
Nov 22 at 15:59
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the
ID
values from, what do you consider a main category, a sub category, and so on?– Martin Honnen
Nov 22 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 at 16:25
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 at 16:29
i use saxon HE 9.8.
– saloda
Nov 22 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option
-x
to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.– Martin Honnen
Nov 22 at 16:39
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option
-x
to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.– Martin Honnen
Nov 22 at 16:39
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
thank you very much
– saloda
Nov 23 at 10:47
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
thank you very much
– saloda
Nov 23 at 10:47
add a comment |
up vote
0
down vote
accepted
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
thank you very much
– saloda
Nov 23 at 10:47
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).
As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse
function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.
Here is an example that uses the online copy and parses the input HTML into some XML format I made up:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud
To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json
function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/1
The final step then would to use the function xml-to-json
(https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="text"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>
<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>
<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>
<xsl:mode name="json" on-no-match="shallow-skip"/>
<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/2
https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.
As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">
<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>
<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>
<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>
<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>
<xsl:mode on-no-match="shallow-copy"/>
<xsl:output method="json" indent="yes"/>
<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>
<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>
<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>
<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>
<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>
<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>
<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>
</xsl:stylesheet>
https://xsltfiddle.liberty-development.net/3NzcBud/4
edited Nov 23 at 8:38
answered Nov 22 at 21:00
Martin Honnen
110k65876
110k65876
thank you very much
– saloda
Nov 23 at 10:47
add a comment |
thank you very much
– saloda
Nov 23 at 10:47
thank you very much
– saloda
Nov 23 at 10:47
thank you very much
– saloda
Nov 23 at 10:47
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53433682%2fparse-html-file-obtaining-data-from-nested-categories-hierarchy-using-xslt-3%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the
ID
values from, what do you consider a main category, a sub category, and so on?– Martin Honnen
Nov 22 at 15:59
And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 at 16:25
i use saxon HE 9.8.
– saloda
Nov 22 at 16:29
It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option
-x
to name a parser so if TagSoup is on the class path then-x:org.ccil.cowan.tagsoup.Parser
is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.– Martin Honnen
Nov 22 at 16:39