lib(xml)


Note for ECLiPSe users

This code creates and accepts character lists rather than ECLiPSe strings. 
To convert between character lists and (UTF8 or ASCII) strings, use the
ECLiPSe built-in string_list/3. For example, to parse a UTF-8 encoded
XML file, use the following code:

xml_parse_file(File, Document) :-
	open(File, read, Stream),
	read_string(Stream, end_of_file, _, Utf8String),
	close(Stream),
	string_list(Utf8String, Chars, utf8),
	xml_parse(Chars, Document).

This is Revision 2.0 of John Fletcher's code.
Most of the subsequent text is taken literally from

http://www.binding-time.co.uk/xmlpl.html.


TERMS AND CONDITIONS

This program is offered free of charge, as unsupported source code. You may
use it, copy it, distribute it, modify it or sell it without restriction,
but entirely at your own risk.

We hope that it will be useful to you, but it is provided "as is" without
any warranty express or implied, including but not limited to the warranty
of non-infringement and the implied warranties of merchantability and fitness
for a particular purpose.


History:
$Log: xml_comments.ecl,v $
Revision 1.4  2009/07/16 09:11:23  jschimpf
Merged patches_6_0 branch up to merge_2009_07_16

Revision 1.3.2.2  2009/04/09 02:11:38  jschimpf
Updated the url in documentation

Revision 1.3.2.1  2009/02/19 06:26:40  jschimpf
Added comment(categories,...) annotations for better documentation

Revision 1.3  2006/10/17 22:06:22  jschimpf
Reinserted lost licensing paragraph.

Revision 1.2  2006/10/17 22:02:21  jschimpf
Upgraded to John Fletcher's revision 2.0, released 2006/06/18,
available at http://www.zen37763.zen.co.uk/xml_download.html

Revision 1.1  2003/03/31 13:58:02  js10
Upgraded to latest version from John Fletcher's web site

Revision 1.2  2002/03/26 22:56:55  js10
Added John Fletcher's public domain XML parser/generator

Revision 1.1  2002/03/26 22:50:07  js10
Added John Fletcher's public domain XML parser/generator


  Background
  xml.pl is a module for parsing XML with Prolog, which provides
Prolog applications with a simple "Document Value Model"
interface to XML documents. It has been used successfully in a number of applications.
  It supports a subset of XML suitable
for XML Data and Worldwide Web applications. It is not as strict nor as
comprehensive as the XML 1.0 Specification mandates.
  It is not as strict, because, while the
specification must eliminate ambiguities, not all errors need to be regarded as
faults, and some reasonable examples of real XML usage would have to be
rejected if they were.
  It is not as comprehensive, because,
where the XML specification makes provision for more or less complete DTDs to be provided as part of a
document, xml.pl actions the local definition of ENTITIES only. Other DTD extensions are treated as
commentary.
  
   The code, and a
small Windows application which embodies it, has been placed into the public domain, to
encourage the use of Prolog with XML.
  I hope that they will be useful to
you, but they are not supported, and they are provided without any warranty of any kind.
  Specification
  Three predicates are exported by the
module: xml_parse/[2,3], xml_subterm/2 and xml_pp/1.
  
   xml_parse( {+Controls}, +?Chars,
?+Document ) parses Chars, a list of character codes,
to/from a data structure of the form 
xml(
<attributes>, 
<content>)
 , where:
  
   
    
<attributes>
    is a list of 

<name>=
<char data>
 attributes from the (possibly implicit) XML signature of the
document.
  
   
    
<content>
    is a (possibly empty) list comprising occurrences of :
  
   
    
pcdata(
<char data>)

   
   Text
   
    
comment(
<char data>)

   
   An xml comment;
   
    
namespace(
<URI>,
<prefix>,
<element>)

   
   a Namespace
   
    
element(
<tag>, 
<attributes>, 
<content>)

   
   
    
     
<tag>..</tag>
     encloses 
<content> or 
<tag /> if empty.
   
    
instructions(
<name>, 
<char data>)

   
   A PI  
<?
<name>
<char data>
?>
   
    
cdata(
<char data>)

   
   <![CDATA[
<char data>]]>

   
    
doctype(
<tag>, 
<doctype id>)

   
   
    
DTD <!DOCTYPE .. >
   
  
  The conversions are not completely
symmetrical, in that weaker XML is accepted than can be generated.
Specifically, in-bound (Chars ->
Document) parsing does not require strictly well-formed XML. If Chars does not represent well-formed
XML, Document is instantiated
to the term malformed(
<attributes>, 
<content>)


 .
  The 
<content> of a malformed/2
structure can include:
  
   
    
unparsed( 
<char data> )

   
   Text which has not been parsed
   
    
out_of_context( 
<tag> )

   
   
    
     
<tag>
     is not closed

  
  in addition to the parsed term
types.
  Out-bound (Document -> Chars) parsing does require that Document defines well-formed XML. If
an error is detected a 'domain' exception is raised.
  The domain exception will attempt to
identify the particular sub-term in error and the message will show a list of
its ancestor elements in the form 
<tag>{(id)}* where 

<id>
 is the value of any attribute named id.
  At this release, the Controls applying
to in-bound (Chars ->
Document) parsing are:
  
   
    
extended_characters(
<bool>)

   
   Use the extended character entities for XHTML (default true).
   
    
format(
<bool>)

   
   Remove layouts
when no non-layout character data appears between elements (default true).
   
    
remove_attribute_prefixes(
<bool>)

   
   Remove redundant prefixes from attributes - i.e. prefixes
   denoting the namespace of the parent element (default false).
   
    
allow_ampersand(
<bool>)

   
   Allow unescaped ampersand characters (&) to occur in PCDATA
(default false).
  
  For out-bound (Document -> Chars) parsing, the
only available option is:
  
   
    
format(
<bool>)

   
   Indent the element content, (default true)
  
  Types
  
   
    
     
<tag>
    
   
   An atom naming an element
   
    
     
<name>
    
   
   An atom, not naming an element
   
    
     
<URI>
    
   
   An atom giving the URI of a Namespace
   
    
     
<char data>
    
   
   A "string": list of character codes.
   
    
     
<doctype id>
    
   
   one of 
public(
<char data>, 
<char data>)
,
public(
<char data>,

<char data>,

<dtd literals>),
system(
<char data>),
system(
<char data>,

<dtd literals>),
local or local(
<dtd literals>)
   
    
     
<dtd literals>
    
   
   A non-empty list of
	dtd_literal(
<char data>) terms - e.g. attribute-list
declarations.
   
    
     
<bool>
    
   
   one of true
or false
  
  
   xml_subterm( +XMLTerm, ?Subterm ) unifies Subterm
   with a sub-term
of Term. This can be especially
useful when trying to test or retrieve a deeply-nested subterm from a document
- as demonstrated in this example program.
Note that XMLTerm is a sub-term of itself.
  
   xml_pp( +XMLDocument )"pretty
prints" XMLDocument on the
current output stream.
  Availability
  The module is available from this site, and is supplied as a library with the following Prologs:
  
   It is available in the ECLiPSe Constraint Programming System, as a
third-party library;
   It has been ported to B-Prolog
by Neng-Fa Zhou.
   It has been adapted for SICStus Prolog version 3.11+
by Mats Carlsson.
   It is included in Quintus Prolog Release 3.5.
  
  Features of xml.pl
  The xml/2 data structure has some useful properties.
  Reusability
  Using an "abstract" Prolog
representation of XML, in which terms represent document "nodes", makes the
parser reuseable for any XML application.
  In effect, xml.pl encapsulates the
application-independent tasks of document parsing and generation, which is
essential where documents have components from more than one Namespace.
  Same Structure
  The Prolog term representing a document
has the same structure as the document itself, which makes the correspondence
between the literal representation of the Prolog term and the XML source
readily apparent.
  For example, this simple SVG image:
     
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.0//EN" "http://www.w3.org/.../svg10.dtd"
    [
    <!ENTITY redblue "fill: red; stroke: blue; stroke-width: 1">
    ]>
<svg xmlns="http://www.w3.org/2000/svg" width="500" height="500">
 <circle cx=" 25 " cy=" 25 " r=" 24 " style="&redblue;"/>
</svg>
  
  ... translates into this Prolog
term:
     
xml( [version="1.0", standalone="no"],
    [
    doctype( svg, public( "-//W3C//DTD SVG 1.0//EN", "http://www.w3.org/.../svg10.dtd" ) ),
    namespace( 'http://www.w3.org/2000/svg', "",
        element( svg,
            [width="500", height="500"],
            [
            element( circle,
                [cx="25", cy="25", r="24", style="fill: red; stroke: blue; stroke-width: 1"],
                [] )
            ] )
        )
    ] ).
  
  Efficient Manipulation
  Each type of node in an XML document is
represented by a different Prolog functor, while data, (PCDATA, CDATA and
Attribute Values), are left as "strings", (lists of character codes).
  The use of distinct functors for
mark-up structures enables the efficient recursive traversal of a document,
while leaving the data as strings facilitates application-specific parsing of
data content (aka Micro-parsing).
  
   For example, to turn every CDATA node
into a PCDATA node with tabs expanded into spaces:
       
cdata_to_pcdata( cdata(CharsWithTabs), pcdata(CharsWithSpaces) ) :-
    tab_expansion( CharsWithTabs, CharsWithSpaces ).
cdata_to_pcdata( xml(Attributes, Content1), xml(Attributes, Content2) ) :-
    cdata_to_pcdata( Content1, Content2 ).
cdata_to_pcdata( namespace(URI,Prefix,Content1), namespace(URI,Prefix,Content2) ) :-
    cdata_to_pcdata( Content1, Content2 ).
cdata_to_pcdata( element(Name,Attrs,Content1), element(Name,Attrs,Content2) ) :-
    cdata_to_pcdata( Content1, Content2 ).
cdata_to_pcdata( [], [] ).
cdata_to_pcdata( [H1|T1], [H2|T2] ) :-
    cdata_to_pcdata( H1, H2 ),
    cdata_to_pcdata( T1, T2 ).
cdata_to_pcdata( pcdata(Chars), pcdata(Chars) ).
cdata_to_pcdata( comment(Chars), comment(Chars) ).
cdata_to_pcdata( instructions(Name, Chars), instructions(Name, Chars) ).
cdata_to_pcdata( doctype(Tag, DoctypeId), doctype(Tag, DoctypeId) ).
   
  
  The above uses no 'cuts', but will not
create any choice points with ground input.
  Elegance
  The resolution of entity references and
the decomposition of the document into distinct nodes means that the calling
application is not concerned with the occasionally messy syntax of XML
documents.
  For example, the clean separation of
namespace nodes means that Namespaces, which are useful in combining
specifications developed separately, have similar usefulness in combining
applications developed separately.
  
   The source code is available here.
Although it is unsupported, please feel free to e-mail queries and suggestions. I
will respond as time allows.


