xml-conduit-1.9.1.3: Pure-Haskell utilities for dealing with XML with the conduit package.
Safe HaskellSafe-Inferred
LanguageHaskell2010

Text.XML.Stream.Parse

Description

This module provides both a native Haskell solution for parsing XML documents into a stream of events, and a set of parser combinators for dealing with a stream of events.

As a simple example:

>>> :set -XOverloadedStrings
>>> import Conduit (runConduit, (.|))
>>> import Data.Text (Text, unpack)
>>> import Data.XML.Types (Event)
>>> data Person = Person Int Text Text deriving Show
>>> :{
let parsePerson :: MonadThrow m => ConduitT Event o m (Maybe Person)
    parsePerson = tag' "person" parseAttributes $ \(age, goodAtHaskell) -> do
      name <- content
      return $ Person (read $ unpack age) name goodAtHaskell
      where parseAttributes = (,) <$> requireAttr "age" <*> requireAttr "goodAtHaskell" <* ignoreAttrs
    parsePeople :: MonadThrow m => ConduitT Event o m (Maybe [Person])
    parsePeople = tagNoAttr "people" $ many parsePerson
    inputXml = mconcat
      [ "<?xml version=\"1.0\" encoding=\"utf-8\"?>"
      , "<people>"
      , "  <person age=\"25\" goodAtHaskell=\"yes\">Michael</person>"
      , "  <person age=\"2\" goodAtHaskell=\"might become\">Eliezer</person>"
      , "</people>"
      ]
:}
>>> runConduit $ parseLBS def inputXml .| force "people required" parsePeople
[Person 25 "Michael" "yes",Person 2 "Eliezer" "might become"]

This module also supports streaming results using yield. This allows parser results to be processed using conduits while a particular parser (e.g. many) is still running. Without using streaming results, you have to wait until the parser finished before you can process the result list. Large XML files might be easier to process by using streaming results. See http://stackoverflow.com/q/21367423/2597135 for a related discussion.

>>> import Data.Conduit.List as CL
>>> :{
let parsePeople' :: MonadThrow m => ConduitT Event Person m (Maybe ())
    parsePeople' = tagNoAttr "people" $ manyYield parsePerson
:}
>>> runConduit $ parseLBS def inputXml .| force "people required" parsePeople' .| CL.mapM_ print
Person 25 "Michael" "yes"
Person 2 "Eliezer" "might become"

Previous versions of this module contained a number of more sophisticated functions written by Aristid Breitkreuz and Dmitry Olshansky. To keep this package simpler, those functions are being moved to a separate package. This note will be updated with the name of the package(s) when available.

Synopsis

Parsing XML files

parseBytes :: MonadThrow m => ParseSettings -> ConduitT ByteString Event m () Source #

Parses a byte stream into Events. This function is implemented fully in Haskell using attoparsec-text for parsing. The produced error messages do not give line/column information, so you may prefer to stick with the parser provided by libxml-enumerator. However, this has the advantage of not relying on any C libraries.

This relies on detectUtf to determine character encoding, and parseText to do the actual parsing.

parseText :: MonadThrow m => ParseSettings -> ConduitT Text Event m () Source #

Parses a character stream into Events. This function is implemented fully in Haskell using attoparsec-text for parsing. The produced error messages do not give line/column information, so you may prefer to stick with the parser provided by libxml-enumerator. However, this has the advantage of not relying on any C libraries.

Since 1.2.4

parseTextPos :: MonadThrow m => ParseSettings -> ConduitT Text EventPos m () Source #

Same as parseText, but includes the position of each event.

Since 1.2.4

detectUtf :: MonadThrow m => ConduitT ByteString Text m () Source #

Automatically determine which UTF variant is being used. This function first checks for BOMs, removing them as necessary, and then check for the equivalent of <?xml for each of UTF-8, UTF-16LEBE, and UTF-32LEBE. It defaults to assuming UTF-8.

parseFile :: MonadResource m => ParseSettings -> FilePath -> ConduitT i Event m () Source #

A helper function which reads a file from disk using enumFile, detects character encoding using detectUtf, parses the XML using parseBytes, and then hands off control to your supplied parser.

parseLBS :: MonadThrow m => ParseSettings -> ByteString -> ConduitT i Event m () Source #

Parse an event stream from a lazy ByteString.

Parser settings

data ParseSettings Source #

Instances

Instances details
Default ParseSettings Source # 
Instance details

Defined in Text.XML.Stream.Parse

def :: Default a => a Source #

The default value for this type.

psDecodeIllegalCharacters :: ParseSettings -> DecodeIllegalCharacters Source #

How to decode illegal character references (&#[0-9]+; or &#x[0-9a-fA-F]+;).

Character references within the legal ranges defined by the standard are automatically parsed. Others are passed to this function.

Default: const Nothing

Since 1.7.1

psRetainNamespaces :: ParseSettings -> Bool Source #

Whether the original xmlns attributes should be retained in the parsed values. For more information on motivation, see:

https://github.com/snoyberg/xml/issues/38

Default: False

Since 1.2.1

psEntityExpansionSizeLimit :: ParseSettings -> Int Source #

Maximum number of characters allowed in expanding an internal entity. This is intended to protect against the billion laughs attack.

Default: 8192

Since 1.9.1

psIgnoreInternalEntityDeclarations :: ParseSettings -> Bool Source #

Whether to resolve any but the predefined entities.

Default: False

Entity decoding

decodeXmlEntities :: DecodeEntities Source #

Default implementation of DecodeEntities, which leaves the entity as-is. Numeric character references and the five standard entities (lt, gt, amp, quot, pos) are handled internally by the parser.

decodeHtmlEntities :: DecodeEntities Source #

HTML4-compliant entity decoder. Handles the additional 248 entities defined by HTML 4 and XHTML 1.

Note that HTML 5 introduces a drastically larger number of entities, and this code does not recognize most of them.

Event parsing

tag Source #

Arguments

:: MonadThrow m 
=> NameMatcher a

Check if this is a correct tag name and return a value that can be used to get an AttrParser. If this fails, the function will return Nothing

-> (a -> AttrParser b)

Given the value returned by the name checker, this function will be used to get an AttrParser appropriate for the specific tag. If the AttrParser fails, the function will also return Nothing

-> (b -> ConduitT Event o m c)

Handler function to handle the attributes and children of a tag, given the value return from the AttrParser

-> ConduitT Event o m (Maybe c) 

The most generic way to parse a tag. It takes a NameMatcher to check whether this is a correct tag name, an AttrParser to handle attributes, and then a parser to deal with content.

Events are consumed if and only if the tag name and its attributes match.

This function automatically absorbs its balancing closing tag, and will throw an exception if not all of the attributes or child elements are consumed. If you want to allow extra attributes, see ignoreAttrs.

This function automatically ignores comments, instructions and whitespace.

tag' :: MonadThrow m => NameMatcher a -> AttrParser b -> (b -> ConduitT Event o m c) -> ConduitT Event o m (Maybe c) Source #

A simplified version of tag where the NameMatcher result isn't forwarded to the attributes parser.

Since 1.5.0

tagNoAttr Source #

Arguments

:: MonadThrow m 
=> NameMatcher a

Check if this is a correct tag name

-> ConduitT Event o m b

Handler function to handle the children of the matched tag

-> ConduitT Event o m (Maybe b) 

A further simplified tag parser, which requires that no attributes exist.

tagIgnoreAttrs Source #

Arguments

:: MonadThrow m 
=> NameMatcher a

Check if this is a correct tag name

-> ConduitT Event o m b

Handler function to handle the children of the matched tag

-> ConduitT Event o m (Maybe b) 

A further simplified tag parser, which ignores all attributes, if any exist

content :: MonadThrow m => ConduitT Event o m Text Source #

Grabs the next piece of content. If none if available, returns empty. This is simply a wrapper around contentMaybe.

contentMaybe :: MonadThrow m => ConduitT Event o m (Maybe Text) Source #

Grabs the next piece of content if available. This function skips over any comments, instructions or entities, and concatenates all content until the next start or end tag.

Ignoring tags/trees

ignoreEmptyTag Source #

Arguments

:: MonadThrow m 
=> NameMatcher a

Check if this is a correct tag name

-> ConduitT Event o m (Maybe ()) 

Ignore an empty tag and all of its attributes. This does not ignore the tag recursively (i.e. it assumes there are no child elements). This function returns Just () if the tag matched.

Since 1.5.0

ignoreTree :: MonadThrow m => NameMatcher a -> AttrParser b -> ConduitT Event o m (Maybe ()) Source #

Same as takeTree, without yielding Events.

>>> runConduit $ parseLBS def "<a>content</a><b></b>" .| (ignoreTree "a" ignoreAttrs >> sinkList)
[EventBeginElement (Name {nameLocalName = "b", ...}) [],EventEndElement (Name {nameLocalName = "b", ...}),EventEndDocument]
>>> runConduit $ parseLBS def "<a>content</a>" .| (ignoreTree "b" ignoreAttrs >> sinkList)
[EventBeginElement (Name {nameLocalName = "a", ...}) [],EventContent (ContentText "content"),EventEndElement (Name {nameLocalName = "a", ...}),EventEndDocument]
>>> runConduit $ parseLBS def "content<a></a>" .| (ignoreTree anyName ignoreAttrs >> sinkList)
[EventContent (ContentText "content"),EventBeginElement (Name {nameLocalName = "a", ...}) [],EventEndElement (Name {nameLocalName = "a", ...}),EventEndDocument]

Since 1.9.0

ignoreContent :: MonadThrow m => ConduitT Event o m (Maybe ()) Source #

Same as takeContent, without yielding Events.

>>> runConduit $ parseLBS def "<a>content</a>" .| (ignoreContent >> sinkList)
[EventBeginElement (Name {nameLocalName = "a", ...}) [],EventContent (ContentText "content"),EventEndElement (Name {nameLocalName = "a", ...}),EventEndDocument]
>>> runConduit $ parseLBS def "content<a></a>" .| (ignoreContent >> sinkList)
[EventBeginElement (Name {nameLocalName = "a", ...}) [],EventEndElement (Name {nameLocalName = "a", ...}),EventEndDocument]
>>> runConduit $ parseLBS def "content<a></a>" .| (ignoreContent >> sinkList)
[EventBeginElement (Name {nameLocalName = "a", ...}) [],EventEndElement (Name {nameLocalName = "a", ...}),EventEndDocument]

Since 1.9.0

ignoreTreeContent :: MonadThrow m => NameMatcher a -> AttrParser b -> ConduitT Event o m (Maybe ()) Source #

Same as takeTreeContent, without yielding Events.

>>> runConduit $ parseLBS def "<a>content</a><b></b>" .| (ignoreTreeContent "a" ignoreAttrs >> sinkList)
[EventBeginElement (Name {nameLocalName = "b", ...}) [],EventEndElement (Name {nameLocalName = "b", ...}),EventEndDocument]
>>> runConduit $ parseLBS def "<a>content</a>" .| (ignoreTreeContent "b" ignoreAttrs >> sinkList)
[EventBeginElement (Name {nameLocalName = "a", ...}) [],EventContent (ContentText "content"),EventEndElement (Name {nameLocalName = "a", ...}),EventEndDocument]
>>> runConduit $ parseLBS def "content<a></a>" .| (ignoreTreeContent anyName ignoreAttrs >> sinkList)
[EventBeginElement (Name {nameLocalName = "a", ...}) [],EventEndElement (Name {nameLocalName = "a", ...}),EventEndDocument]

Since 1.5.0

ignoreAnyTreeContent :: MonadThrow m => ConduitT Event o m (Maybe ()) Source #

Same as takeAnyTreeContent, without yielding Events.

>>> runConduit $ parseLBS def "<a>content</a><b></b>" .| (ignoreAnyTreeContent >> sinkList)
[EventBeginElement (Name {nameLocalName = "b", ...}) [],EventEndElement (Name {nameLocalName = "b", ...}),EventEndDocument]
>>> runConduit $ parseLBS def "text<b></b>" .| (ignoreAnyTreeContent >> sinkList)
[EventBeginElement (Name {nameLocalName = "b", ...}) [],EventEndElement (Name {nameLocalName = "b", ...}),EventEndDocument]

Since 1.5.0

Streaming events

takeContent :: MonadThrow m => ConduitT Event Event m (Maybe ()) Source #

Stream a single content Event.

Returns Just () if a content Event was consumed, Nothing otherwise.

>>> runConduit $ parseLBS def "content<a></a>" .| void takeContent .| sinkList
[EventBeginDocument,EventContent (ContentText "content")]

If next event isn't a content, nothing is consumed.

>>> runConduit $ parseLBS def "<a>content</a>" .| void takeContent .| sinkList
[EventBeginDocument]

Since 1.5.0

takeTree :: MonadThrow m => NameMatcher a -> AttrParser b -> ConduitT Event Event m (Maybe ()) Source #

Stream Events corresponding to a single XML element that matches given NameMatcher and AttrParser, from the opening- to the closing-tag.

>>> runConduit $ parseLBS def "<a>content</a><b></b>" .| void (takeTree "a" ignoreAttrs) .| sinkList
[EventBeginDocument,EventBeginElement (Name {nameLocalName = "a", ...}) [],EventContent (ContentText "content"),EventEndElement (Name {nameLocalName = "a", ...})]
>>> runConduit $ parseLBS def "<a>content</a><b></b>" .| void (takeTree "b" ignoreAttrs) .| sinkList
[EventBeginDocument]

If next Event isn't an element, nothing is consumed.

>>> runConduit $ parseLBS def "text<a></a>" .| void (takeTree "a" ignoreAttrs) .| sinkList
[EventBeginDocument]

If an opening-tag is consumed but no matching closing-tag is found, an XmlException is thrown.

>>> runConduit $ parseLBS def "<a><b></b>" .| void (takeTree "a" ignoreAttrs) .| sinkList
*** Exception: InvalidEndElement (Name {nameLocalName = "a", nameNamespace = Nothing, namePrefix = Nothing}) Nothing

This function automatically ignores comments, instructions and whitespace.

Returns Just () if an element was consumed, Nothing otherwise.

Since 1.5.0

takeTreeContent :: MonadThrow m => NameMatcher a -> AttrParser b -> ConduitT Event Event m (Maybe ()) Source #

Like takeTree, but can also stream a content Event.

>>> runConduit $ parseLBS def "<a>content</a><b></b>" .| void (takeTreeContent "a" ignoreAttrs) .| sinkList
[EventBeginDocument,EventBeginElement (Name {nameLocalName = "a", ...}) [],EventContent (ContentText "content"),EventEndElement (Name {nameLocalName = "a", ...})]
>>> runConduit $ parseLBS def "<a>content</a><b></b>" .| void (takeTreeContent "b" ignoreAttrs) .| sinkList
[EventBeginDocument]
>>> runConduit $ parseLBS def "content<a></a><b></b>" .| void (takeTreeContent "a" ignoreAttrs) .| sinkList
[EventBeginDocument,EventContent (ContentText "content")]

Since 1.5.0

takeAnyTreeContent :: MonadThrow m => ConduitT Event Event m (Maybe ()) Source #

Like takeTreeContent, without checking for tag name or attributes.

>>> runConduit $ parseLBS def "text<a></a>" .| void takeAnyTreeContent .| sinkList
[EventBeginDocument,EventContent (ContentText "text")]
>>> runConduit $ parseLBS def "</a><b></b>" .| void takeAnyTreeContent .| sinkList
[EventBeginDocument]
>>> runConduit $ parseLBS def "<b><c></c></b></a>text" .| void takeAnyTreeContent .| sinkList
[EventBeginDocument,EventBeginElement (Name {nameLocalName = "b", ...}) [],EventBeginElement (Name {nameLocalName = "c", ...}) [],EventEndElement (Name {nameLocalName = "c", ...}),EventEndElement (Name {nameLocalName = "b", ...})]

Since 1.5.0

Tag name matching

newtype NameMatcher a Source #

A NameMatcher describes which names a tag parser is allowed to match.

Since 1.5.0

Constructors

NameMatcher 

Fields

Instances

Instances details
Alternative NameMatcher Source #

NameMatchers can be combined with <|>

Instance details

Defined in Text.XML.Stream.Parse

Applicative NameMatcher Source # 
Instance details

Defined in Text.XML.Stream.Parse

Methods

pure :: a -> NameMatcher a #

(<*>) :: NameMatcher (a -> b) -> NameMatcher a -> NameMatcher b #

liftA2 :: (a -> b -> c) -> NameMatcher a -> NameMatcher b -> NameMatcher c #

(*>) :: NameMatcher a -> NameMatcher b -> NameMatcher b #

(<*) :: NameMatcher a -> NameMatcher b -> NameMatcher a #

Functor NameMatcher Source # 
Instance details

Defined in Text.XML.Stream.Parse

Methods

fmap :: (a -> b) -> NameMatcher a -> NameMatcher b #

(<$) :: a -> NameMatcher b -> NameMatcher a #

a ~ Name => IsString (NameMatcher a) Source #

Match a single Name in a concise way. Note that Name is namespace sensitive: when using the IsString instance, use "{http://a/b}c" to match the tag c in the XML namespace http://a/b

Instance details

Defined in Text.XML.Stream.Parse

matching :: (Name -> Bool) -> NameMatcher Name Source #

matching f matches name iff f name is true. Returns the matched Name.

Since 1.5.0

anyOf :: [Name] -> NameMatcher Name Source #

Matches any Name from the given list. Returns the matched Name.

Since 1.5.0

anyName :: NameMatcher Name Source #

Matches any Name. Returns the matched Name.

Since 1.5.0

Attribute parsing

data AttrParser a Source #

A monad for parsing attributes. By default, it requires you to deal with all attributes present on an element, and will throw an exception if there are unhandled attributes. Use the requireAttr, attr et al functions for handling an attribute, and ignoreAttrs if you would like to skip the rest of the attributes on an element.

Alternative instance behaves like First monoid: it chooses first parser which doesn't fail.

Instances

Instances details
Alternative AttrParser Source # 
Instance details

Defined in Text.XML.Stream.Parse

Applicative AttrParser Source # 
Instance details

Defined in Text.XML.Stream.Parse

Methods

pure :: a -> AttrParser a #

(<*>) :: AttrParser (a -> b) -> AttrParser a -> AttrParser b #

liftA2 :: (a -> b -> c) -> AttrParser a -> AttrParser b -> AttrParser c #

(*>) :: AttrParser a -> AttrParser b -> AttrParser b #

(<*) :: AttrParser a -> AttrParser b -> AttrParser a #

Functor AttrParser Source # 
Instance details

Defined in Text.XML.Stream.Parse

Methods

fmap :: (a -> b) -> AttrParser a -> AttrParser b #

(<$) :: a -> AttrParser b -> AttrParser a #

Monad AttrParser Source # 
Instance details

Defined in Text.XML.Stream.Parse

Methods

(>>=) :: AttrParser a -> (a -> AttrParser b) -> AttrParser b #

(>>) :: AttrParser a -> AttrParser b -> AttrParser b #

return :: a -> AttrParser a #

MonadThrow AttrParser Source # 
Instance details

Defined in Text.XML.Stream.Parse

Methods

throwM :: (HasCallStack, Exception e) => e -> AttrParser a #

attr :: Name -> AttrParser (Maybe Text) Source #

Return the value for an attribute if present.

requireAttr :: Name -> AttrParser Text Source #

Shortcut composition of force and attr.

optionalAttr :: Name -> AttrParser (Maybe Text) Source #

Deprecated: Please use attr.

ignoreAttrs :: AttrParser () Source #

Skip the remaining attributes on an element. Since this will clear the list of attributes, you must call this after any calls to requireAttr, optionalAttr, etc.

Combinators

orE Source #

Arguments

:: Monad m 
=> ConduitT Event o m (Maybe a)

The first (preferred) parser

-> ConduitT Event o m (Maybe a)

The second parser, only executed if the first parser fails

-> ConduitT Event o m (Maybe a) 

Get the value of the first parser which returns Just. If no parsers succeed (i.e., return Just), this function returns Nothing.

orE a b = choose [a, b]

Warning: orE doesn't backtrack. See choose for detailed explanation.

choose Source #

Arguments

:: Monad m 
=> [ConduitT Event o m (Maybe a)]

List of parsers that will be tried in order.

-> ConduitT Event o m (Maybe a)

Result of the first parser to succeed, or Nothing if no parser succeeded

Get the value of the first parser which returns Just. If no parsers succeed (i.e., return Just), this function returns Nothing.

Warning: choose doesn't backtrack. If a parser consumed some events, subsequent parsers will continue from the following events. This can be a problem if parsers share an accepted prefix of events, so an earlier (failing) parser will discard the events that the later parser could potentially succeed on.

An other problematic case is using choose to implement order-independent parsing using a set of parsers, with a final trailing ignore-anything-else action. In this case, certain trees might be skipped.

>>> :{
let parse2Tags name1 name2 = do
      tag1 <- tagNoAttr name1 (pure ())
      tag2 <- tagNoAttr name2 (pure tag1)
      return $ join tag2
:}
>>> :{
runConduit $ parseLBS def "<a></a><b></b>" .| choose
  [ parse2Tags "a" "b"
  , parse2Tags "a" "c"
  ]
:}
Just ()
>>> :{
runConduit $ parseLBS def "<a></a><b></b>" .| choose
  [ parse2Tags "a" "c"
  , parse2Tags "a" "b"
  ]
:}
Nothing

many :: Monad m => ConduitT Event o m (Maybe a) -> ConduitT Event o m [a] Source #

Keep parsing elements as long as the parser returns Just.

many_ :: MonadThrow m => ConduitT Event o m (Maybe a) -> ConduitT Event o m () Source #

Like many but discards the results without building an intermediate list.

Since 1.5.0

manyIgnore :: Monad m => ConduitT Event o m (Maybe a) -> ConduitT Event o m (Maybe b) -> ConduitT Event o m [a] Source #

Keep parsing elements as long as the parser returns Just or the ignore parser returns Just.

many' :: MonadThrow m => ConduitT Event o m (Maybe a) -> ConduitT Event o m [a] Source #

Like many, but any tags and content the consumer doesn't match on are silently ignored.

force Source #

Arguments

:: MonadThrow m 
=> String

Error message

-> m (Maybe a)

Optional parser to be forced

-> m a 

Force an optional parser into a required parser. All of the tag functions, attr, choose and many deal with Maybe parsers. Use this when you want to finally force something to happen.

Streaming combinators

manyYield :: Monad m => ConduitT a b m (Maybe b) -> ConduitT a b m () Source #

Like many, but uses yield so the result list can be streamed to downstream conduits without waiting for manyYield to finish

manyYield' :: MonadThrow m => ConduitT Event b m (Maybe b) -> ConduitT Event b m () Source #

Like many', but uses yield so the result list can be streamed to downstream conduits without waiting for manyYield' to finish

manyIgnoreYield Source #

Arguments

:: MonadThrow m 
=> ConduitT Event b m (Maybe b)

Consuming parser that generates the result stream

-> ConduitT Event b m (Maybe ())

Ignore parser that consumes elements to be ignored

-> ConduitT Event b m () 

Like manyIgnore, but uses yield so the result list can be streamed to downstream conduits without waiting for manyIgnoreYield to finish

Exceptions

Other types