How can I use the Regular Expression to pull data?
Hey, folks!
I was hoping there was someone in the community who knew how to parse HTML with the Regular Expression object better than I could (which is next to none!).
I am trying to use the Regular Expression object to parse the title of HTML files (literally, from the <Title> tags). However, my attempts to do so have been unsuccessful; I don't know if I'm performing an incorrect action or using an incorrect expression to display the results.
Is there anyone out there who could give me direction on how to parse the title of HTML files from the <Title> tags?
Thank you very much for your help! I really appreciate it!
Most graciously...
RGBreality
Re: How can I use the Regular Expression to pull data?
Even though regular expressions are really bad at parsing HTML, in this case you don't even need to use RegEx to find the title.
If you for example use String Parser 2, you can find the position of the first occurrence of "<title>" (an index number) and then the position of "</title>". You now have two index numbers where you want to find the string in between. You do that by using
Mid$(<html here>, firstIndex + 7, lastIndex-firstIndex)
That should give you the title. The +7 is the length of '<title>'.
Re: How can I use the Regular Expression to pull data?
Hey, Andos! Thank you for your reply!
So, let me see if I can follow you here in a specific manner... All of the following uses the String Parser 2 object:
Action 1: Set default delimiter to "<title>".
Action 2: Add delimiter "</title>".
Action 3: Perform a Mid$(<PATH TO HTML FILE>, .....)
It's here that I get confused... How do I specify the index of the first (or default) delimiter, and then the second? I did see the "Get Delimiter Index" option, but that seemed to refer to the delimiter index you assigned that delimiter (not its placement in the source document).
Could you offer me a little more guidance!
Thank you very much, Andos! I really appreciate it!
Most graciously...
RGBreality
Re: How can I use the Regular Expression to pull data?
<html here> = the string containing the HTML
I modified it a little to only use String Parser 2:
http://andersriggelsen.dk/uploads/extractTitle.mfa
Re: How can I use the Regular Expression to pull data?
Hey, Andos!
I still seem to be having some problems, though at least I'm getting varied results as I try different things...
I'm not sure if I'm translating your example correctly (without using the counters). Here is the actual code I'm using to pull the specific title information from a Regular Expression object that contains the text of the HTML file:
Mid$(GetString$( "Regular Expression object" ), indexOfSub( "String Parser", "<title>", 1)+6, indexOfSub( "String Parser", "</title>", 1)-7)
Yet, if I simply put in the actual text from the Regular Expression object, the HTML code appears. Furthermore, if I use a starting character of 459 as the first character to extract the middle string, then I get proper results 99% of the time (once in a while it seems an HTML file's "<title>" tag begins later).
So, I'm not sure what I might be doing wrong. Any ideas?
Thank you for giving me a hand!
Most appreciatively...
RGBreality
Re: How can I use the Regular Expression to pull data?
Pretty sure it should be:
Quote:
Originally Posted by RGBreality
Mid$(GetString$( "Regular Expression object" ), indexOfSub( "String Parser", "<title>", 1)+7, indexOfSub( "String Parser", "</title>", 1)-indexOfSub( "String Parser", "<title>", 1))
Re: How can I use the Regular Expression to pull data?
RGBreality, you don't need the Regular Expression object at all. Look at Andos' example. Use it with the raw, unparsed HTML.
Re: How can I use the Regular Expression to pull data?
Hey, folks! Thank you for all your help!
I was able to figure a work-around (using the Regular Expression object as search criteria for the "</title>" string.
I'm afraid I can't pull directly from the raw HTML file, as I am only working from the directory path to the raw HTML file. So, I have been using the Regular Expression object to import the raw HTML file into its internal string; this was proving necessary as I'm using a derivative of Nifflas' (wrong spelling, sorry!) file-search example to find specific text from the file-search results. In doing so, I'm only referencing directory paths to the HTML files. (I have found that using Nifflas' groundwork, searching through thousands of pages of my company's HTML Help documents is TONS faster than using the Search object's indexing mode.)
(It very well could be that I'm doing this totally ass-backwards, as I'm definitely still a MMF2 newbie. But at least I'm getting the results I wanted!)
So, the syntax I used for the Extract Middle String expression ended up like this:
Mid$(GetString$( "Regular Expression object" ), 456, Submatch Start( "Regular Expression", 1)-458)
(where "Submatch Start" is the "</" portion of the "</title>" tag).
The search functionality of this application I'm developing has turned out really well (though, of course, I stand on the shoulders of MMF2 giants). If anyone is interested, I'd be glad to share the source file once I have the search parameters finished (though I am horrible at documenting comments).
Thanks again for everyone's assistance! I really do appreciate it!
Most graciously...
RGBreality