parse html file, obtaining data from nested categories hierarchy using xslt 3











up vote
0
down vote

favorite












given the following html file:



http://bpeck.com/references/DDC/ddc_mine900.htm



http://bpeck.com/references/DDC/ddc_mine200.htm



http://bpeck.com/references/DDC/ddc_mine500.htm



etc,



how i could get an output that would show me the hierarchy of the categories?



/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------


Ideally if the output result could be in json format, but i guess xml could do.



Struggled with a serial parser (SAX), but failed, looking for an elegant solution.



main categories



    900 World History

910 Geography and travel [see area subdivisions]

920 Biography, genealogy, insignia

930 History of the ancient world

940 General history of Europe [check schedules for date subdivisions]

950 General history of Asia, Far East


etc...



subcategories of 900:



900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons


...



example of a sub sub category found under 909 World history:



909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-


output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8










share|improve this question
























  • The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the ID values from, what do you consider a main category, a sub category, and so on?
    – Martin Honnen
    Nov 22 at 15:59










  • And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
    – Martin Honnen
    Nov 22 at 16:25










  • i use saxon HE 9.8.
    – saloda
    Nov 22 at 16:29












  • It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option -x to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
    – Martin Honnen
    Nov 22 at 16:39















up vote
0
down vote

favorite












given the following html file:



http://bpeck.com/references/DDC/ddc_mine900.htm



http://bpeck.com/references/DDC/ddc_mine200.htm



http://bpeck.com/references/DDC/ddc_mine500.htm



etc,



how i could get an output that would show me the hierarchy of the categories?



/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------


Ideally if the output result could be in json format, but i guess xml could do.



Struggled with a serial parser (SAX), but failed, looking for an elegant solution.



main categories



    900 World History

910 Geography and travel [see area subdivisions]

920 Biography, genealogy, insignia

930 History of the ancient world

940 General history of Europe [check schedules for date subdivisions]

950 General history of Asia, Far East


etc...



subcategories of 900:



900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons


...



example of a sub sub category found under 909 World history:



909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-


output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8










share|improve this question
























  • The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the ID values from, what do you consider a main category, a sub category, and so on?
    – Martin Honnen
    Nov 22 at 15:59










  • And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
    – Martin Honnen
    Nov 22 at 16:25










  • i use saxon HE 9.8.
    – saloda
    Nov 22 at 16:29












  • It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option -x to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
    – Martin Honnen
    Nov 22 at 16:39













up vote
0
down vote

favorite









up vote
0
down vote

favorite











given the following html file:



http://bpeck.com/references/DDC/ddc_mine900.htm



http://bpeck.com/references/DDC/ddc_mine200.htm



http://bpeck.com/references/DDC/ddc_mine500.htm



etc,



how i could get an output that would show me the hierarchy of the categories?



/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------


Ideally if the output result could be in json format, but i guess xml could do.



Struggled with a serial parser (SAX), but failed, looking for an elegant solution.



main categories



    900 World History

910 Geography and travel [see area subdivisions]

920 Biography, genealogy, insignia

930 History of the ancient world

940 General history of Europe [check schedules for date subdivisions]

950 General history of Asia, Far East


etc...



subcategories of 900:



900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons


...



example of a sub sub category found under 909 World history:



909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-


output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8










share|improve this question















given the following html file:



http://bpeck.com/references/DDC/ddc_mine900.htm



http://bpeck.com/references/DDC/ddc_mine200.htm



http://bpeck.com/references/DDC/ddc_mine500.htm



etc,



how i could get an output that would show me the hierarchy of the categories?



/---------------------
| ID | Name
| 1 | Main Category
| 3 | Sub Category
| 5 | Sub-Sub Category
| 4 | Sub Category
| 2 | Next Main Category
----------------------


Ideally if the output result could be in json format, but i guess xml could do.



Struggled with a serial parser (SAX), but failed, looking for an elegant solution.



main categories



    900 World History

910 Geography and travel [see area subdivisions]

920 Biography, genealogy, insignia

930 History of the ancient world

940 General history of Europe [check schedules for date subdivisions]

950 General history of Asia, Far East


etc...



subcategories of 900:



900 Geography & history
901 Philosophy & theory
902 Miscellany
903 Dictionaries & encyclopedias
904 Collected accounts of events
905 Serial publications
906 Organizations & management
907 Education, research, related topics
908 With respect to kinds of persons


...



example of a sub sub category found under 909 World history:



909.7 18th century, 1700-1799
909.8 1800-
909.82 1900-


output i would prefer the best approach you judge would be the best.
each key would be the ID, which is 900, 901, 902, etc, and the corresponding value would be the name: Geography & history, Philosophy & theory, Miscellany. This output json should be nested, showing the hierarchy of the categories.
I use saxon HE version 9.8







xslt-3.0






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 at 16:28

























asked Nov 22 at 15:02









saloda

32




32












  • The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the ID values from, what do you consider a main category, a sub category, and so on?
    – Martin Honnen
    Nov 22 at 15:59










  • And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
    – Martin Honnen
    Nov 22 at 16:25










  • i use saxon HE 9.8.
    – saloda
    Nov 22 at 16:29












  • It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option -x to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
    – Martin Honnen
    Nov 22 at 16:39


















  • The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the ID values from, what do you consider a main category, a sub category, and so on?
    – Martin Honnen
    Nov 22 at 15:59










  • And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
    – Martin Honnen
    Nov 22 at 16:25










  • i use saxon HE 9.8.
    – saloda
    Nov 22 at 16:29












  • It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option -x to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
    – Martin Honnen
    Nov 22 at 16:39
















The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the ID values from, what do you consider a main category, a sub category, and so on?
– Martin Honnen
Nov 22 at 15:59




The result you have shown doesn't show any sample data, nor it is XML or JSON. So where do you get the ID values from, what do you consider a main category, a sub category, and so on?
– Martin Honnen
Nov 22 at 15:59












And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 at 16:25




And which XSLT 3 processor do you use as the input is not XML but HTML? Do you use Saxon PE or EE where you have saxonica.com/html/documentation/functions/saxon/parse-html.html or some other way to plug in an HTML parser into the tool chain?
– Martin Honnen
Nov 22 at 16:25












i use saxon HE 9.8.
– saloda
Nov 22 at 16:29






i use saxon HE 9.8.
– saloda
Nov 22 at 16:29














It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option -x to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
– Martin Honnen
Nov 22 at 16:39




It depends, do you use the Java version of Saxon HE? How do you run it, from the command line, from Java code? Can you install TagSoup vrici.lojban.org/~cowan/XML/tagsoup? Saxon (even HE) has a command line option -x to name a parser so if TagSoup is on the class path then -x:org.ccil.cowan.tagsoup.Parser is supposed to use that parser for the input instead of a normal XML parser, then you have the (X)HTML tree TagSoup gives as the input.
– Martin Honnen
Nov 22 at 16:39












1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted










The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).



As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.



Here is an example that uses the online copy and parses the input HTML into some XML format I made up:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud



To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/1



The final step then would to use the function xml-to-json (https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="text"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/2



https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.



As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="json" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>

<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>

<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>

<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>

<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/4






share|improve this answer























  • thank you very much
    – saloda
    Nov 23 at 10:47











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53433682%2fparse-html-file-obtaining-data-from-nested-categories-hierarchy-using-xslt-3%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote



accepted










The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).



As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.



Here is an example that uses the online copy and parses the input HTML into some XML format I made up:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud



To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/1



The final step then would to use the function xml-to-json (https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="text"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/2



https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.



As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="json" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>

<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>

<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>

<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>

<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/4






share|improve this answer























  • thank you very much
    – saloda
    Nov 23 at 10:47















up vote
0
down vote



accepted










The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).



As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.



Here is an example that uses the online copy and parses the input HTML into some XML format I made up:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud



To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/1



The final step then would to use the function xml-to-json (https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="text"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/2



https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.



As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="json" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>

<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>

<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>

<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>

<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/4






share|improve this answer























  • thank you very much
    – saloda
    Nov 23 at 10:47













up vote
0
down vote



accepted







up vote
0
down vote



accepted






The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).



As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.



Here is an example that uses the online copy and parses the input HTML into some XML format I made up:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud



To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/1



The final step then would to use the function xml-to-json (https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="text"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/2



https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.



As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="json" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>

<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>

<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>

<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>

<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/4






share|improve this answer














The data you have seems poorly structured (only checked http://bpeck.com/references/DDC/ddc_mine900.htm but that doesn't pass HTML validation at https://validator.w3.org/check?uri=http%3A%2F%2Fbpeck.com%2Freferences%2FDDC%2Fddc_mine900.htm&charset=%28detect+automatically%29&doctype=Inline&group=0, in particular the lists of subcategories are not properly nested so some XSLT plumbing is needed).



As for parsing HTML with XSLT 2 or 3, if you can't set up Saxon to use an HTML parser like TagSoup instead of an XML parser for the input you can try to use David Carlisle's htmlparse function implemented in pure XSLT 2, it is available online at https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl, make sure you download a local copy if you want to use it to parse your HTML in XSLT 2 or 3 with good performance.



Here is an example that uses the online copy and parses the input HTML into some XML format I made up:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud



To create JSON with XSLT 3, you have a couple of options, one is to have your stylesheet create the format the xml-to-json function expects (https://www.w3.org/TR/xslt-30/#json-to-xml-mapping); in the following example I extend the above stylesheet with a mode that takes the previous result XML to create the XML input you can feed to xml-to-json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/1



The final step then would to use the function xml-to-json (https://www.w3.org/TR/xpath-functions/#func-xml-to-json) to output JSON instead of XML:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string">http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="text"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:variable name="categories">
<categories>
<xsl:apply-templates select="tail($html-doc//table)"/>
</categories>
</xsl:variable>
<xsl:variable name="json-xml">
<xsl:apply-templates select="$categories" mode="json"/>
</xsl:variable>
<xsl:sequence select="xml-to-json($json-xml, map { 'indent' : true() })"/>
</xsl:template>

<xsl:template match="table">
<category>
<xsl:sequence select="mf:create-attributes(tr[1]/td[1])"/>
<xsl:apply-templates select="head(tr)/td[2], tail(tr)/td[1]"/>
</category>
</xsl:template>

<xsl:template match="td">
<subcategory>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="following-sibling::td[1]/ul"/>
</subcategory>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<sub-sub-category>
<xsl:sequence select="mf:create-attributes(.)"/>
<xsl:apply-templates select="tail(current-group())"/>
</sub-sub-category>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:create-attributes" as="attribute()*">
<xsl:param name="input" as="xs:string"/>
<xsl:variable name="input-components" as="xs:string*" select="tokenize(normalize-space($input))"/>
<xsl:attribute name="name" select="head($input-components)"/>
<xsl:attribute name="title" select="tail($input-components)"/>
</xsl:function>

<xsl:mode name="json" on-no-match="shallow-skip"/>

<xsl:template match="category | subcategory | sub-sub-category" mode="json">
<fn:map>
<fn:map key="{@name}">
<fn:string key="title">{@title}</fn:string>
<xsl:where-populated>
<fn:array key="children">
<xsl:apply-templates mode="#current"/>
</fn:array>
</xsl:where-populated>
</fn:map>
</fn:map>
</xsl:template>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/2



https://xsltfiddle.liberty-development.net/3NzcBud/3 is the same code applied to a different input file, at least the XML -> XML -> JSON generation doesn't break, I haven't checked whether the HTML table and lists are having the same structure as in the previous input.



As another option to create JSON with XSLT 3 and support for XPath 3.1 map and array data types (which both Saxon 9.8/9.9 in all editions provide as well as Altova 2017/2018/2019) you can directly create maps and arrays and serialize with method json:



<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:d="data:,dpc"
xmlns:mf="http://example.com/mf"
xmlns:fn="http://www.w3.org/2005/xpath-functions"
xmlns:map="http://www.w3.org/2005/xpath-functions/map"
expand-text="yes"
exclude-result-prefixes="#all"
version="3.0">

<xsl:import href="https://github.com/davidcarlisle/web-xslt/raw/master/htmlparse/htmlparse.xsl"/>

<xsl:param name="html-file" as="xs:string"
>http://bpeck.com/references/DDC/ddc_mine900.htm</xsl:param>

<xsl:param name="html-text" as="xs:string" select="unparsed-text($html-file)"/>

<xsl:variable name="html-doc" select="d:htmlparse($html-text, '', true())"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output method="json" indent="yes"/>

<xsl:template match="/" name="xsl:initial-template">
<xsl:apply-templates select="tail($html-doc//table)"/>
</xsl:template>

<xsl:template match="table">
<xsl:sequence select="mf:category-map(tr[1]/td[1], (head(tr)/td[2], tail(tr)/td[1]))"/>
</xsl:template>

<xsl:template match="td">
<xsl:sequence select="mf:category-map(., following-sibling::td[1]/ul)"/>
</xsl:template>

<xsl:template match="ul">
<xsl:for-each-group select="*" group-starting-with="li">
<xsl:sequence select="mf:category-map(., tail(current-group()))"/>
</xsl:for-each-group>
</xsl:template>

<xsl:function name="mf:split-index-title" as="xs:string*">
<xsl:param name="input" as="xs:string"/>
<xsl:sequence
select="
let $components := tokenize(normalize-space($input))
return
(head($components), string-join(tail($components), ' '))"
/>
</xsl:function>

<xsl:function name="mf:category-map" as="map(xs:string, item())">
<xsl:param name="category" as="element()"/>
<xsl:param name="subcategories" as="element()*"/>
<xsl:variable name="components" select="mf:split-index-title($category)"/>
<xsl:map>
<xsl:map-entry key="$components[1]">
<xsl:map>
<xsl:map-entry key="'title'" select="$components[2]"/>
<xsl:if test="$subcategories">
<xsl:map-entry key="'children'">
<xsl:sequence select="array{ mf:child-categories($subcategories) }"/>
</xsl:map-entry>
</xsl:if>
</xsl:map>
</xsl:map-entry>
</xsl:map>
</xsl:function>

<xsl:function name="mf:child-categories" as="map(xs:string, item())*">
<xsl:param name="subcategories" as="element()*"/>
<xsl:apply-templates select="$subcategories"/>
</xsl:function>

</xsl:stylesheet>


https://xsltfiddle.liberty-development.net/3NzcBud/4







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 23 at 8:38

























answered Nov 22 at 21:00









Martin Honnen

110k65876




110k65876












  • thank you very much
    – saloda
    Nov 23 at 10:47


















  • thank you very much
    – saloda
    Nov 23 at 10:47
















thank you very much
– saloda
Nov 23 at 10:47




thank you very much
– saloda
Nov 23 at 10:47


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53433682%2fparse-html-file-obtaining-data-from-nested-categories-hierarchy-using-xslt-3%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to ignore python UserWarning in pytest?

What visual should I use to simply compare current year value vs last year in Power BI desktop

Script to remove string up to first number