Retrieve elements from a page?
So, I wanted to create an application that would download a web page and display it's elements in a separate interface...
For instance, go to mailinator.com, automatically login to a specified acount and list all e-mail subjects in a list object.
Is there an easy way to parse an HTML file and pick the elements I want?
Re: Retrieve elements from a page?
You could parse it with an XML parser, they're mostly the same thing with a few differences.
Re: Retrieve elements from a page?
Check this out:
http://www.clickteam.com/epicenter/ubbthreads.php?ubb=showflat&Main=18799&Number=1337 03
Re: Retrieve elements from a page?
Quote:
Originally Posted by _LB
You could parse it with an XML parser, they're mostly the same thing with a few differences.
No can do...
EasyXML outputs an error and XML Parser Object just plain crashes
SEELE, will you convert that example to OINC? Since MOO isn't supported anymore and stuff...
Edit: Can do, actually, it pops errors like crazy, but still parses the page... whee!
XML Parser Object just crashes all the time, only EasyXML is usable...
Re: Retrieve elements from a page?
HTML isn't based on XML, it's based on SGML. They're very similar, but you will see errors thrown all over the place when you put HTML through an XML parser. There is a variant of HTML based on XML instead, called XHTML, but it's not very common.
SGML (and so HTML) allows implicitly closed elements, e.g. "<br>" for a line break. XML (and so XHTML) requires "<br/>", "<br>" without a closing "/" is illegal XML. To make things more fun, "<br/>" is illegal HTML. An XML parser run on HTML would interpret the text after the line break as being inside the "<br>", then throw an error about there being no close "</br>".
For another example, "<p>text<p>text2<p>text3" in HTML represents three paragraphs (each "<p>" is implicitly closed by the next), in XML it would be interpreted as three "<p>", each inside the one before, then errors about the closing three "</p>" all being missing. If it is a validating parser and has an XHTML DTD, it will also throw errors about "<p>" not being allowed inside "<p>".
Re: Retrieve elements from a page?
I'm parsing a XHTML file...
Do you have any idea on how to approach this problem without having to write a parser from scratch?
By the way, I have to parse the file in order to get text (like the subjects of mails) from a page, number of unread mails, things like that
Re: Retrieve elements from a page?
Doesn't my example show how to avoid just that?
Re: Retrieve elements from a page?
Quote:
Originally Posted by Fimbul
SEELE, will you convert that example to OINC? Since MOO isn't supported anymore and stuff...
Um... no. Moo is better for connecting to the internet and getting data than OINC is at current state.