1/* Part of SWI-Prolog 2 3 Author: Jan Wielemaker 4 E-mail: J.Wielemaker@vu.nl 5 WWW: http://www.swi-prolog.org 6 Copyright (c) 2009-2016, University of Amsterdam 7 VU University Amsterdam 8 All rights reserved. 9 10 Redistribution and use in source and binary forms, with or without 11 modification, are permitted provided that the following conditions 12 are met: 13 14 1. Redistributions of source code must retain the above copyright 15 notice, this list of conditions and the following disclaimer. 16 17 2. Redistributions in binary form must reproduce the above copyright 18 notice, this list of conditions and the following disclaimer in 19 the documentation and/or other materials provided with the 20 distribution. 21 22 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 23 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 24 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 25 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 26 COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 27 INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 28 BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 29 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 30 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 32 ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 33 POSSIBILITY OF SUCH DAMAGE. 34*/ 35 36:- module(xpath, 37 [ xpath/3, % +DOM, +Spec, -Value 38 xpath_chk/3, % +DOM, +Spec, -Value 39 40 op(400, fx, //), 41 op(400, fx, /), 42 op(200, fy, @) 43 ]). 44:- use_module(library(record)). 45:- use_module(library(lists)). 46:- use_module(library(debug)). 47:- use_module(library(error)). 48:- use_module(library(sgml)).
65:- record
66 element(name, attributes, content).
72xpath_chk(DOM, Spec, Content) :-
73 xpath(DOM, Spec, Content),
74 !.
//
Term/
Term
The Terms above are of type callable. The functor specifies
the element name. The element name '*' refers to any element.
The name self
refers to the top-element itself and is often
used for processing matches of an earlier xpath/3 query. A term
NS:Term refers to an XML name in the namespace NS. Optional
arguments specify additional constraints and functions. The
arguments are processed from left to right. Defined conditional
argument values are:
last
last
- IntExprlast-1
is the element directly preceding the last one.index(Integer)
.last
index(last)
.last
- IntExprindex(last-IntExpr)
.Defined function argument values are:
self
content
text
normalize_space
text
, but uses normalize_space/2 to normalise
white-space in the outputnumber
@
Attributelibrary(sgml)
.number
, but subsequently transform the value
into an integer using the round/1 function.number
, but subsequently transform the value
into a float using the float/1 function.In addition, the argument-list can be conditions:
content = content
defines that the content
of the element is the atom content
.
The functions lower_case
and upper_case
can be applied
to Right (see example below).contains(Haystack, Needle)
h3
element inside a div
element, where the div
element itself contains an h2
child with a strong
child.
//div(h2/strong)/h3
This is equivalent to the conjunction of XPath goals below.
..., xpath(DOM, //(div), Div), xpath(Div, h2/strong, _), xpath(Div, h3, Result)
Examples:
Match each table-row in DOM:
xpath(DOM, //tr, TR)
Match the last cell of each tablerow in DOM. This example illustrates that a result can be the input of subsequent xpath/3 queries. Using multiple queries on the intermediate TR term guarantee that all results come from the same table-row:
xpath(DOM, //tr, TR), xpath(TR, /td(last), TD)
Match each href
attribute in an <a> element
xpath(DOM, //a(@href), HREF)
Suppose we have a table containing rows where each first column is the name of a product with a link to details and the second is the price (a number). The following predicate matches the name, URL and price:
product(DOM, Name, URL, Price) :- xpath(DOM, //tr, TR), xpath(TR, td(1), C1), xpath(C1, /self(normalize_space), Name), xpath(C1, a(@href), URL), xpath(TR, td(2, number), Price).
Suppose we want to select books with genre="thriller" from a
tree containing elements <book genre=...>
thriller(DOM, Book) :- xpath(DOM, //book(@genre=thiller), Book).
Match the elements <table align="center">
and <table
align="CENTER">
:
//table(@align(lower) = center)
Get the width
and height
of a div
element as a number,
and the div
node itself:
xpath(DOM, //div(@width(number)=W, @height(number)=H), Div)
Note that div
is an infix operator, so parentheses must be
used in cases like the following:
xpath(DOM, //(div), Div)
257xpath(DOM, Spec, Content) :- 258 in_dom(Spec, DOM, Content). 259 260in_dom(//Spec, DOM, Value) :- 261 !, 262 element_spec(Spec, Name, Modifiers), 263 sub_dom(I, Len, Name, E, DOM), 264 modifiers(Modifiers, I, Len, E, Value). 265in_dom(/Spec, E, Value) :- 266 !, 267 element_spec(Spec, Name, Modifiers), 268 ( Name == self 269 -> true 270 ; element_name(E, Name) 271 ), 272 modifiers(Modifiers, 1, 1, E, Value). 273in_dom(A/B, DOM, Value) :- 274 !, 275 in_dom(A, DOM, Value0), 276 in_dom(B, Value0, Value). 277in_dom(A//B, DOM, Value) :- 278 !, 279 in_dom(A, DOM, Value0), 280 in_dom(//B, Value0, Value). 281in_dom(Spec, element(_, _, Content), Value) :- 282 element_spec(Spec, Name, Modifiers), 283 count_named_elements(Content, Name, CLen), 284 CLen > 0, 285 nth_element(N, Name, E, Content), 286 modifiers(Modifiers, N, CLen, E, Value). 287 288element_spec(Var, _, _) :- 289 var(Var), 290 !, 291 instantiation_error(Var). 292element_spec(NS:Term, NS:Name, Modifiers) :- 293 !, 294 callable_name_arguments(Term, Name0, Modifiers), 295 star(Name0, Name). 296element_spec(Term, Name, Modifiers) :- 297 !, 298 callable_name_arguments(Term, Name0, Modifiers), 299 star(Name0, Name). 300 301callable_name_arguments(Atom, Name, Arguments) :- 302 atom(Atom), 303 !, 304 Name = Atom, Arguments = []. 305callable_name_arguments(Compound, Name, Arguments) :- 306 compound_name_arguments(Compound, Name, Arguments). 307 308 309star(*, _) :- !. 310star(Name, Name).
322sub_dom(1, 1, Name, DOM, DOM) :- 323 element_name(DOM, Name). 324sub_dom(N, Len, Name, E, element(_,_,Content)) :- 325 !, 326 sub_dom_2(N, Len, Name, E, Content). 327sub_dom(N, Len, Name, E, Content) :- 328 is_list(Content), 329 sub_dom_2(N, Len, Name, E, Content). 330 331sub_dom_2(N, Len, Name, Element, Content) :- 332 ( count_named_elements(Content, Name, Len), 333 nth_element(N, Name, Element, Content) 334 ; member(element(_,_,C2), Content), 335 sub_dom_2(N, Len, Name, Element, C2) 336 ).
343count_named_elements(Content, Name, Count) :- 344 count_named_elements(Content, Name, 0, Count). 345 346count_named_elements([], _, Count, Count). 347count_named_elements([element(Name,_,_)|T], Name0, C0, C) :- 348 \+ Name \= Name0, 349 !, 350 C1 is C0+1, 351 count_named_elements(T, Name0, C1, C). 352count_named_elements([_|T], Name, C0, C) :- 353 count_named_elements(T, Name, C0, C).
360nth_element(N, Name, Element, Content) :- 361 nth_element_(1, N, Name, Element, Content). 362 363nth_element_(I, N, Name, E, [H|T]) :- 364 element_name(H, Name), 365 !, 366 ( N = I, 367 E = H 368 ; I2 is I + 1, 369 ( nonvar(N), I2 > N 370 -> !, fail 371 ; true 372 ), 373 nth_element_(I2, N, Name, E, T) 374 ). 375nth_element_(I, N, Name, E, [_|T]) :- 376 nth_element_(I, N, Name, E, T).
383modifiers([], _, _, Value, Value). 384modifiers([H|T], I, L, Value0, Value) :- 385 modifier(H, I, L, Value0, Value1), 386 modifiers(T, I, L, Value1, Value). 387 388modifier(M, _, _, _, _) :- 389 var(M), 390 !, 391 instantiation_error(M). 392modifier(Index, I, L, Value0, Value) :- 393 implicit_index_modifier(Index), 394 !, 395 Value = Value0, 396 index_modifier(Index, I, L). 397modifier(index(Index), I, L, Value, Value) :- 398 !, 399 index_modifier(Index, I, L). 400modifier(Function, _, _, In, Out) :- 401 xpath_function(Function), 402 !, 403 xpath_function(Function, In, Out). 404modifier(Function, _, _, In, Out) :- 405 xpath_condition(Function, In), 406 Out = In. 407 408implicit_index_modifier(I) :- 409 integer(I), 410 !. 411implicit_index_modifier(last). 412implicit_index_modifier(last-_Expr). 413 414index_modifier(Var, I, _L) :- 415 var(Var), 416 !, 417 Var = I. 418index_modifier(last, I, L) :- 419 !, 420 I =:= L. 421index_modifier(last-Expr, I, L) :- 422 !, 423 I =:= L-Expr. 424index_modifier(N, I, _) :- 425 N =:= I. 426 427xpath_function(self, DOM, Value) :- % self 428 !, 429 Value = DOM. 430xpath_function(content, Element, Value) :- % content 431 !, 432 element_content(Element, Value). 433xpath_function(text, DOM, Text) :- % text 434 !, 435 text_of_dom(DOM, Text). 436xpath_function(normalize_space, DOM, Text) :- % normalize_space 437 !, 438 text_of_dom(DOM, Text0), 439 normalize_space(atom(Text), Text0). 440xpath_function(number, DOM, Number) :- % number 441 !, 442 text_of_dom(DOM, Text0), 443 normalize_space(string(Text), Text0), 444 catch(xsd_number_string(Number, Text), _, fail). 445xpath_function(@Name, element(_, Attrs, _), Value) :- % @Name 446 !, 447 ( atom(Name) 448 -> memberchk(Name=Value, Attrs) 449 ; compound(Name) 450 -> compound_name_arguments(Name, AName, AOps), 451 memberchk(AName=Value0, Attrs), 452 translate_attribute(AOps, Value0, Value) 453 ; member(Name=Value, Attrs) 454 ). 455xpath_function(quote(Value), _, Value). % quote(Value) 456 457xpath_function(self). 458xpath_function(content). 459xpath_function(text). 460xpath_function(normalize_space). 461xpath_function(number). 462xpath_function(@_). 463xpath_function(quote(_)). 464 465translate_attribute([], Value, Value). 466translate_attribute([H|T], Value0, Value) :- 467 translate_attr(H, Value0, Value1), 468 translate_attribute(T, Value1, Value). 469 470translate_attr(number, Value0, Value) :- 471 xsd_number_string(Value, Value0). 472translate_attr(integer, Value0, Value) :- 473 xsd_number_string(Value1, Value0), 474 Value = round(Value1). 475translate_attr(float, Value0, Value) :- 476 xsd_number_string(Value1, Value0), 477 Value = float(Value1). 478translate_attr(string, Value0, Value) :- 479 atom_string(Value0, Value). 480translate_attr(lower, Value0, Value) :- 481 ( atom(Value0) 482 -> downcase_atom(Value0, Value) 483 ; string_lower(Value0, Value) 484 ). 485translate_attr(upper, Value0, Value) :- 486 ( atom(Value0) 487 -> upcase_atom(Value0, Value) 488 ; string_upper(Value0, Value) 489 ). 490 491xpath_condition(Left = Right, Value) :- % = 492 !, 493 var_or_function(Left, Value, LeftValue), 494 process_equality(LeftValue, Right). 495xpath_condition(contains(Haystack, Needle), Value) :- % contains(Haystack, Needle) 496 !, 497 val_or_function(Haystack, Value, HaystackValue), 498 val_or_function(Needle, Value, NeedleValue), 499 atom(HaystackValue), atom(NeedleValue), 500 ( sub_atom(HaystackValue, _, _, _, NeedleValue) 501 -> true 502 ). 503xpath_condition(Spec, Dom) :- 504 in_dom(Spec, Dom, _).
For example the XPath expression in [1], and the equivalent Prolog expression in [2], would both match the HTML element in [3].
[1] //table[align=lower-case(center)] [2] //table(@align=lower_case(center)) [3] <table align="CENTER">
522process_equality(Left, Right) :- 523 var(Right), 524 !, 525 Left = Right. 526process_equality(Left, lower_case(Right)) :- 527 !, 528 downcase_atom(Left, Right). 529process_equality(Left, upper_case(Right)) :- 530 !, 531 upcase_atom(Left, Right). 532process_equality(Left, Right) :- 533 Left = Right. 534 535 536var_or_function(Arg, _, Arg) :- 537 var(Arg), 538 !. 539var_or_function(Func, Value0, Value) :- 540 xpath_function(Func), 541 !, 542 xpath_function(Func, Value0, Value). 543var_or_function(Value, _, Value). 544 545val_or_function(Arg, _, Arg) :- 546 var(Arg), 547 !, 548 instantiation_error(Arg). 549val_or_function(Func, Value0, Value) :- % TBD 550 xpath_function(Func, Value0, Value), 551 !. 552val_or_function(Value, _, Value).
559text_of_dom(DOM, Text) :- 560 phrase(text_of(DOM), Tokens), 561 atomic_list_concat(Tokens, Text). 562 563text_of(element(_,_,Content)) --> 564 text_of_list(Content). 565text_of([]) --> 566 []. 567text_of([H|T]) --> 568 text_of(H), 569 text_of(T). 570 571 572text_of_list([]) --> 573 []. 574text_of_list([H|T]) --> 575 text_of_1(H), 576 text_of_list(T). 577 578 579text_of_1(element(_,_,Content)) --> 580 !, 581 text_of_list(Content). 582text_of_1(Data) --> 583 { assertion(atom(Data)) }, 584 [Data]
Select nodes in an XML DOM
The library
xpath.pl
provides predicates to select nodes from an XML DOM tree as produced bylibrary(sgml)
based on descriptions inspired by the XPath language.The predicate xpath/3 selects a sub-structure of the DOM non-deterministically based on an XPath-like specification. Not all selectors of XPath are implemented, but the ability to mix xpath/3 calls with arbitrary Prolog code provides a powerful tool for extracting information from XML parse-trees.