1/* Part of SWI-Prolog 2 3 Author: Jan Wielemaker 4 E-mail: J.Wielemaker@vu.nl 5 WWW: http://www.swi-prolog.org 6 Copyright (c) 2009-2016, University of Amsterdam 7 VU University Amsterdam 8 All rights reserved. 9 10 Redistribution and use in source and binary forms, with or without 11 modification, are permitted provided that the following conditions 12 are met: 13 14 1. Redistributions of source code must retain the above copyright 15 notice, this list of conditions and the following disclaimer. 16 17 2. Redistributions in binary form must reproduce the above copyright 18 notice, this list of conditions and the following disclaimer in 19 the documentation and/or other materials provided with the 20 distribution. 21 22 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 23 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 24 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 25 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 26 COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 27 INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 28 BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 29 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 30 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 32 ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 33 POSSIBILITY OF SUCH DAMAGE. 34*/ 35 36:- module(xpath, 37 [ xpath/3, % +DOM, +Spec, -Value 38 xpath_chk/3, % +DOM, +Spec, -Value 39 40 op(400, fx, //), 41 op(400, fx, /), 42 op(200, fy, @) 43 ]). 44:- use_module(library(record)). 45:- use_module(library(lists)). 46:- use_module(library(debug)). 47:- use_module(library(error)). 48:- use_module(library(sgml)). 49 50/** <module> Select nodes in an XML DOM 51 52The library xpath.pl provides predicates to select nodes from an XML DOM 53tree as produced by library(sgml) based on descriptions inspired by the 54XPath language. 55 56The predicate xpath/3 selects a sub-structure of the DOM 57non-deterministically based on an XPath-like specification. Not all 58selectors of XPath are implemented, but the ability to mix xpath/3 calls 59with arbitrary Prolog code provides a powerful tool for extracting 60information from XML parse-trees. 61 62@see http://www.w3.org/TR/xpath 63*/ 64 65:- record 66 element(name, attributes, content). 67 68%! xpath_chk(+DOM, +Spec, ?Content) is semidet. 69% 70% Semi-deterministic version of xpath/3. 71 72xpath_chk(DOM, Spec, Content) :- 73 xpath(DOM, Spec, Content), 74 !. 75 76%! xpath(+DOM, +Spec, ?Content) is nondet. 77% 78% Match an element in a DOM structure. The syntax is inspired by 79% XPath, using () rather than [] to select inside an element. 80% First we can construct paths using / and //: 81% 82% $ =|//|=Term : 83% Select any node in the DOM matching term. 84% $ =|/|=Term : 85% Match the root against Term. 86% $ Term : 87% Select the immediate children of the root matching Term. 88% 89% The Terms above are of type _callable_. The functor specifies 90% the element name. The element name '*' refers to any element. 91% The name =self= refers to the top-element itself and is often 92% used for processing matches of an earlier xpath/3 query. A term 93% NS:Term refers to an XML name in the namespace NS. Optional 94% arguments specify additional constraints and functions. The 95% arguments are processed from left to right. Defined conditional 96% argument values are: 97% 98% $ index(?Index) : 99% True if the element is the Index-th child of its parent, 100% where 1 denotes the first child. Index can be one of: 101% $ `Var` : 102% `Var` is unified with the index of the matched element. 103% $ =last= : 104% True for the last element. 105% $ =last= - `IntExpr` : 106% True for the last-minus-nth element. For example, 107% `last-1` is the element directly preceding the last one. 108% $ `IntExpr` : 109% True for the element whose index equals `IntExpr`. 110% $ Integer : 111% The N-th element with the given name, with 1 denoting the 112% first element. Same as index(Integer). 113% $ =last= : 114% The last element with the given name. Same as 115% index(last). 116% $ =last= - IntExpr : 117% The IntExpr-th element before the last. 118% Same as index(last-IntExpr). 119% 120% Defined function argument values are: 121% 122% $ =self= : 123% Evaluate to the entire element 124% $ =content= : 125% Evaluate to the content of the element (a list) 126% $ =text= : 127% Evaluates to all text from the sub-tree as an atom 128% $ =normalize_space= : 129% As =text=, but uses normalize_space/2 to normalise 130% white-space in the output 131% $ =number= : 132% Extract an integer or float from the value. Ignores 133% leading and trailing white-space 134% $ =|@|=Attribute : 135% Evaluates to the value of the given attribute. Attribute 136% can be a compound term. In this case the functor name 137% denotes the element and arguments perform transformations 138% on the attribute value. Defined transformations are: 139% 140% - number 141% Translate the value into a number using 142% xsd_number_string/2 from library(sgml). 143% - integer 144% As `number`, but subsequently transform the value 145% into an integer using the round/1 function. 146% - float 147% As `number`, but subsequently transform the value 148% into a float using the float/1 function. 149% - string 150% Translate the value into a Prolog string. 151% - lower 152% Translate the value to lower case, preserving 153% the type. 154% - upper 155% Translate the value to upper case, preserving 156% the type. 157% 158% In addition, the argument-list can be _conditions_: 159% 160% $ Left = Right : 161% Succeeds if the left-hand unifies with the right-hand. 162% If the left-hand side is a function, this is evaluated. 163% The right-hand side is _never_ evaluated, and thus the 164% condition `content = content` defines that the content 165% of the element is the atom `content`. 166% The functions `lower_case` and `upper_case` can be applied 167% to Right (see example below). 168% $ contains(Haystack, Needle) : 169% Succeeds if Needle is a sub-string of Haystack. 170% $ XPath : 171% Succeeds if XPath matches in the currently selected 172% sub-DOM. For example, the following expression finds 173% an =h3= element inside a =div= element, where the =div= 174% element itself contains an =h2= child with a =strong= 175% child. 176% 177% == 178% //div(h2/strong)/h3 179% == 180% 181% This is equivalent to the conjunction of XPath goals below. 182% 183% == 184% ..., 185% xpath(DOM, //(div), Div), 186% xpath(Div, h2/strong, _), 187% xpath(Div, h3, Result) 188% == 189% 190% **Examples**: 191% 192% Match each table-row in DOM: 193% 194% == 195% xpath(DOM, //tr, TR) 196% == 197% 198% Match the last cell of each tablerow in DOM. This example 199% illustrates that a result can be the input of subsequent xpath/3 200% queries. Using multiple queries on the intermediate TR term 201% guarantee that all results come from the same table-row: 202% 203% == 204% xpath(DOM, //tr, TR), 205% xpath(TR, /td(last), TD) 206% == 207% 208% Match each =href= attribute in an <a> element 209% 210% == 211% xpath(DOM, //a(@href), HREF) 212% == 213% 214% Suppose we have a table containing rows where each first column 215% is the name of a product with a link to details and the second 216% is the price (a number). The following predicate matches the 217% name, URL and price: 218% 219% == 220% product(DOM, Name, URL, Price) :- 221% xpath(DOM, //tr, TR), 222% xpath(TR, td(1), C1), 223% xpath(C1, /self(normalize_space), Name), 224% xpath(C1, a(@href), URL), 225% xpath(TR, td(2, number), Price). 226% == 227% 228% Suppose we want to select books with genre="thriller" from a 229% tree containing elements =|<book genre=...>|= 230% 231% == 232% thriller(DOM, Book) :- 233% xpath(DOM, //book(@genre=thiller), Book). 234% == 235% 236% Match the elements =|<table align="center">|= _and_ =|<table 237% align="CENTER">|=: 238% 239% ```prolog 240% //table(@align(lower) = center) 241% ``` 242% 243% Get the `width` and `height` of a `div` element as a number, 244% and the `div` node itself: 245% 246% == 247% xpath(DOM, //div(@width(number)=W, @height(number)=H), Div) 248% == 249% 250% Note that `div` is an infix operator, so parentheses must be 251% used in cases like the following: 252% 253% == 254% xpath(DOM, //(div), Div) 255% == 256 257xpath(DOM, Spec, Content) :- 258 in_dom(Spec, DOM, Content). 259 260in_dom(//Spec, DOM, Value) :- 261 !, 262 element_spec(Spec, Name, Modifiers), 263 sub_dom(I, Len, Name, E, DOM), 264 modifiers(Modifiers, I, Len, E, Value). 265in_dom(/Spec, E, Value) :- 266 !, 267 element_spec(Spec, Name, Modifiers), 268 ( Name == self 269 -> true 270 ; element_name(E, Name) 271 ), 272 modifiers(Modifiers, 1, 1, E, Value). 273in_dom(A/B, DOM, Value) :- 274 !, 275 in_dom(A, DOM, Value0), 276 in_dom(B, Value0, Value). 277in_dom(A//B, DOM, Value) :- 278 !, 279 in_dom(A, DOM, Value0), 280 in_dom(//B, Value0, Value). 281in_dom(Spec, element(_, _, Content), Value) :- 282 element_spec(Spec, Name, Modifiers), 283 count_named_elements(Content, Name, CLen), 284 CLen > 0, 285 nth_element(N, Name, E, Content), 286 modifiers(Modifiers, N, CLen, E, Value). 287 288element_spec(Var, _, _) :- 289 var(Var), 290 !, 291 instantiation_error(Var). 292element_spec(NS:Term, NS:Name, Modifiers) :- 293 !, 294 callable_name_arguments(Term, Name0, Modifiers), 295 star(Name0, Name). 296element_spec(Term, Name, Modifiers) :- 297 !, 298 callable_name_arguments(Term, Name0, Modifiers), 299 star(Name0, Name). 300 301callable_name_arguments(Atom, Name, Arguments) :- 302 atom(Atom), 303 !, 304 Name = Atom, Arguments = []. 305callable_name_arguments(Compound, Name, Arguments) :- 306 compound_name_arguments(Compound, Name, Arguments). 307 308 309star(*, _) :- !. 310star(Name, Name). 311 312 313%! sub_dom(-Index, -Count, +Name, -Sub, +DOM) is nondet. 314% 315% Sub is a node in DOM with Name. 316% 317% @param Count is the total number of nodes in the content 318% list Sub appears that have the same name. 319% @param Index is the 1-based index of Sub of nodes with 320% Name. 321 322sub_dom(1, 1, Name, DOM, DOM) :- 323 element_name(DOM, Name). 324sub_dom(N, Len, Name, E, element(_,_,Content)) :- 325 !, 326 sub_dom_2(N, Len, Name, E, Content). 327sub_dom(N, Len, Name, E, Content) :- 328 is_list(Content), 329 sub_dom_2(N, Len, Name, E, Content). 330 331sub_dom_2(N, Len, Name, Element, Content) :- 332 ( count_named_elements(Content, Name, Len), 333 nth_element(N, Name, Element, Content) 334 ; member(element(_,_,C2), Content), 335 sub_dom_2(N, Len, Name, Element, C2) 336 ). 337 338 339%! count_named_elements(+Content, +Name, -Count) is det. 340% 341% Count is the number of nodes with Name in Content. 342 343count_named_elements(Content, Name, Count) :- 344 count_named_elements(Content, Name, 0, Count). 345 346count_named_elements([], _, Count, Count). 347count_named_elements([element(Name,_,_)|T], Name0, C0, C) :- 348 \+ Name \= Name0, 349 !, 350 C1 is C0+1, 351 count_named_elements(T, Name0, C1, C). 352count_named_elements([_|T], Name, C0, C) :- 353 count_named_elements(T, Name, C0, C). 354 355 356%! nth_element(?N, +Name, -Element, +Content:list) is nondet. 357% 358% True if Element is the N-th element with name in Content. 359 360nth_element(N, Name, Element, Content) :- 361 nth_element_(1, N, Name, Element, Content). 362 363nth_element_(I, N, Name, E, [H|T]) :- 364 element_name(H, Name), 365 !, 366 ( N = I, 367 E = H 368 ; I2 is I + 1, 369 ( nonvar(N), I2 > N 370 -> !, fail 371 ; true 372 ), 373 nth_element_(I2, N, Name, E, T) 374 ). 375nth_element_(I, N, Name, E, [_|T]) :- 376 nth_element_(I, N, Name, E, T). 377 378 379%! modifiers(+Modifiers, +I, +Clen, +DOM, -Value) 380% 381% 382 383modifiers([], _, _, Value, Value). 384modifiers([H|T], I, L, Value0, Value) :- 385 modifier(H, I, L, Value0, Value1), 386 modifiers(T, I, L, Value1, Value). 387 388modifier(M, _, _, _, _) :- 389 var(M), 390 !, 391 instantiation_error(M). 392modifier(Index, I, L, Value0, Value) :- 393 implicit_index_modifier(Index), 394 !, 395 Value = Value0, 396 index_modifier(Index, I, L). 397modifier(index(Index), I, L, Value, Value) :- 398 !, 399 index_modifier(Index, I, L). 400modifier(Function, _, _, In, Out) :- 401 xpath_function(Function), 402 !, 403 xpath_function(Function, In, Out). 404modifier(Function, _, _, In, Out) :- 405 xpath_condition(Function, In), 406 Out = In. 407 408implicit_index_modifier(I) :- 409 integer(I), 410 !. 411implicit_index_modifier(last). 412implicit_index_modifier(last-_Expr). 413 414index_modifier(Var, I, _L) :- 415 var(Var), 416 !, 417 Var = I. 418index_modifier(last, I, L) :- 419 !, 420 I =:= L. 421index_modifier(last-Expr, I, L) :- 422 !, 423 I =:= L-Expr. 424index_modifier(N, I, _) :- 425 N =:= I. 426 427xpath_function(self, DOM, Value) :- % self 428 !, 429 Value = DOM. 430xpath_function(content, Element, Value) :- % content 431 !, 432 element_content(Element, Value). 433xpath_function(text, DOM, Text) :- % text 434 !, 435 text_of_dom(DOM, Text). 436xpath_function(normalize_space, DOM, Text) :- % normalize_space 437 !, 438 text_of_dom(DOM, Text0), 439 normalize_space(atom(Text), Text0). 440xpath_function(number, DOM, Number) :- % number 441 !, 442 text_of_dom(DOM, Text0), 443 normalize_space(string(Text), Text0), 444 catch(xsd_number_string(Number, Text), _, fail). 445xpath_function(@Name, element(_, Attrs, _), Value) :- % @Name 446 !, 447 ( atom(Name) 448 -> memberchk(Name=Value, Attrs) 449 ; compound(Name) 450 -> compound_name_arguments(Name, AName, AOps), 451 memberchk(AName=Value0, Attrs), 452 translate_attribute(AOps, Value0, Value) 453 ; member(Name=Value, Attrs) 454 ). 455xpath_function(quote(Value), _, Value). % quote(Value) 456 457xpath_function(self). 458xpath_function(content). 459xpath_function(text). 460xpath_function(normalize_space). 461xpath_function(number). 462xpath_function(@_). 463xpath_function(quote(_)). 464 465translate_attribute([], Value, Value). 466translate_attribute([H|T], Value0, Value) :- 467 translate_attr(H, Value0, Value1), 468 translate_attribute(T, Value1, Value). 469 470translate_attr(number, Value0, Value) :- 471 xsd_number_string(Value, Value0). 472translate_attr(integer, Value0, Value) :- 473 xsd_number_string(Value1, Value0), 474 Value = round(Value1). 475translate_attr(float, Value0, Value) :- 476 xsd_number_string(Value1, Value0), 477 Value = float(Value1). 478translate_attr(string, Value0, Value) :- 479 atom_string(Value0, Value). 480translate_attr(lower, Value0, Value) :- 481 ( atom(Value0) 482 -> downcase_atom(Value0, Value) 483 ; string_lower(Value0, Value) 484 ). 485translate_attr(upper, Value0, Value) :- 486 ( atom(Value0) 487 -> upcase_atom(Value0, Value) 488 ; string_upper(Value0, Value) 489 ). 490 491xpath_condition(Left = Right, Value) :- % = 492 !, 493 var_or_function(Left, Value, LeftValue), 494 process_equality(LeftValue, Right). 495xpath_condition(contains(Haystack, Needle), Value) :- % contains(Haystack, Needle) 496 !, 497 val_or_function(Haystack, Value, HaystackValue), 498 val_or_function(Needle, Value, NeedleValue), 499 atom(HaystackValue), atom(NeedleValue), 500 ( sub_atom(HaystackValue, _, _, _, NeedleValue) 501 -> true 502 ). 503xpath_condition(Spec, Dom) :- 504 in_dom(Spec, Dom, _). 505 506 507%! process_equality(+Left, +Right) is semidet. 508% 509% Provides (very) partial support for XSLT functions that can be 510% applied according to the XPath 2 specification. 511% 512% For example the XPath expression in [1], and the equivalent 513% Prolog expression in [2], would both match the HTML element in 514% [3]. 515% 516% == 517% [1] //table[align=lower-case(center)] 518% [2] //table(@align=lower_case(center)) 519% [3] <table align="CENTER"> 520% == 521 522process_equality(Left, Right) :- 523 var(Right), 524 !, 525 Left = Right. 526process_equality(Left, lower_case(Right)) :- 527 !, 528 downcase_atom(Left, Right). 529process_equality(Left, upper_case(Right)) :- 530 !, 531 upcase_atom(Left, Right). 532process_equality(Left, Right) :- 533 Left = Right. 534 535 536var_or_function(Arg, _, Arg) :- 537 var(Arg), 538 !. 539var_or_function(Func, Value0, Value) :- 540 xpath_function(Func), 541 !, 542 xpath_function(Func, Value0, Value). 543var_or_function(Value, _, Value). 544 545val_or_function(Arg, _, Arg) :- 546 var(Arg), 547 !, 548 instantiation_error(Arg). 549val_or_function(Func, Value0, Value) :- % TBD 550 xpath_function(Func, Value0, Value), 551 !. 552val_or_function(Value, _, Value). 553 554 555%! text_of_dom(+DOM, -Text:atom) is det. 556% 557% Text is the joined textual content of DOM. 558 559text_of_dom(DOM, Text) :- 560 phrase(text_of(DOM), Tokens), 561 atomic_list_concat(Tokens, Text). 562 563text_of(element(_,_,Content)) --> 564 text_of_list(Content). 565text_of([]) --> 566 []. 567text_of([H|T]) --> 568 text_of(H), 569 text_of(T). 570 571 572text_of_list([]) --> 573 []. 574text_of_list([H|T]) --> 575 text_of_1(H), 576 text_of_list(T). 577 578 579text_of_1(element(_,_,Content)) --> 580 !, 581 text_of_list(Content). 582text_of_1(Data) --> 583 { assertion(atom(Data)) }, 584 [Data]