Making Add-on Modules (2)

Making Add-on Modules (2)
(step 2 of the add-on module making process)

$out{error}

HTTP RESPONSE HEADER:

$out{response_header}

HTML CODE of the Search Results page:

$out{search_results_html}

VIEW of the Search Results page:

$out{search_results_view}

The add-on modules need to define regular expressions that parse some text data from the search engine's output. If you are familiar with perl's regular expressions just enter them below. Otherwise use the alternative intuitive form to define regular expressions by defining 'start' and 'end' parsing marks. The marks are numbers appearing in the html code within brackets in red color. Try different combinations of marks if the first your choice did not extract successfully the text you want.

NOTE: All of the following sections are OPTIONAL. You can define regular expressions only for data you want extracted.

SEARCH RESULTS PARSING REGULAR EXPRESSIONS:

tip: Use wider range if parsing is unsuccessful and use shorter range if parsing produce more results than expected. Start testing with middle size ranges.

'Url and Title' This expression parses the result's access url and title.
Define a parsing range from mark# to mark#
or an regular expression:

example: <a href=(http://[^>]+)>([^<]+)</a>
The data that have to be extracted must be enclosed within braces ( ) in the regular expression.
The extracted data will be stored in the following Data names: follow_url title
You can access it in the result template as $out{follow_url}, $out{title}, $out{e_follow_url} etc...
Indices of the ( ) pairs within the regular expression which correspond to the parsed data. Sometimes you will have more ( ) pairs in your regular expression that are not enclosing the needed parsed data. To skip that, point which ( ) pairs to use in the order as they appear from left to right. Usually they are the first and second pairs.
hint: 1 2

'Real Url' In most of the cases you can skip this definition. This expression parses the result's real url explicitly. It is useful if the result's 'follow_url' is different from the real site url and software failed to extract the real url from the 'follow_url'. Some search sites use in the links to the resulted sites only redirecting urls. In most cases if the actual site url is available within the redirecting url the software can automatically guess the 'real_url' without having to parse it separately. Anyway if the actual site's url is displayed you can parse it explicitly now. The 'real_url' result url is used by the program to eliminate duplicated urls from the results set.
Define a parsing range from mark# to mark#
or an regular expression:

example:  URL: (http://[^<]+)<
The extracted data will be stored in the following Data name: real_url
Indices of the ( ) pairs within the regular expression which correspond to the parsed data.
hint: 1

'Description' This expression parses the result's description.
Define a parsing range from mark# to mark#
or an regular expression:

tip: start the parsing range just from the end or the linked title. The beginning mark let be the one just before the closing tag </a>
The extracted data will be stored in the following Data name: description
Indices of the ( ) pairs within the regular expression which correspond to the parsed data.
hint: 1

REGULAR EXPRESSIONS USED TO PARSE ADDITIONAL DATA FOR EACH SEARCH RESULT:

NOTE: Define here extra expressions only if you want to have more data fields extracted for each search result. Also you can skip all of the above definitions and for best performance write only one regular expression that parses all fields you want. Here is an example for an universal parsing that may work for most of the search engines:
example:
Parsing Name: all_results_parsing
Regular Expression: <(dt|li|p)>\s*<a\s*href=['"]([^'"]+)['"][^>]*>(.+?)</a>\s*(.+?)(<(p|/dl|/dd|/ol|/ul|a|/table|/li|(?=<li>))|http)
Data Names: follow_url title description
Order: 2 3 4

Parsing Name

Regular Expression

From mark#

To mark#

Data Names
(separated by spaces)

Order

REGULAR EXPRESSIONS USED WHEN HAVE TO ACCESS 'NEXT PAGE' SEARCH RESULTS:

NOTE: Define these expressions only if you want to access and parse more search result pages from this search engine.

'Next Page URL' Define this expression only if you want to access 'next pages' results and did not defined explicitly in the step1 which is the 'next page' url. Usually this is the url linking the word 'next'.
Define a parsing range from mark# to mark#
or an regular expression:

example: <a\s[^>]*href\s*=\s*"([^\s">]+)"[^>]*>>></a>
The extracted data will be stored in the following data name: next_page_url
Index of the ( ) pair within the regular expression which correspond to the extracted data.
hint: 1

'Is There Next Page' Define this expression only if you did not defined 'Next Page URL' expression and have defined explicitly in the step1 the 'next page' url. This expression tests the output to see if there is a 'next page' to follow. For example if you have in the current page a linked word 'Next >>' you know that there is a next page. If 'Next Page URL' regular expression is not defined the module will produce its next request url only if 'Is There Next Page' expression matches. This expression does not extract data. It is used as a boolean flag.
Define a parsing range from mark# to mark#
or an regular expression:

example: next</a>

REGULAR EXPRESSIONS EXTRACTING GENERAL FOR THE SEARCH QUERY INFORMATION:

'Reported Count' Define this expression only if you want available the number of the reported by the search engine found search results. This expression parses the number or reported by the search engine found results.
Define a parsing range from mark# to mark#
or an regular expression:

example: Displaying results (\d[\d,\s]*)-(\d[\d,\s]*) of (\d[\d,\s]*)
The extracted data will be stored in the following Data name: reported_count
Index of the ( ) pair within the regular expression which correspond to the parsed data.
hint: 3

'Reported No Results' Define this expression if you defined 'Reported Count' regular expression. This expression detects if the search engine found 0 results. You can simply put the sentence or only part of it - the words found on the page that express that there are no found result. This expression does not extract data. It is used as a boolean flag. We use it to make difference between 'unknown' and '0' results reported.
Define a parsing range from mark# to mark#
or an regular expression:

example1: We found 0 results
example2: did not match any documents
universal example: (Sorry|(0|no)\s+(document|result|sites)|did not match|No Web pages found)

THE EXTRA PERL CODE:

Edit below the perl subroutines at your preference. This stuff resides in the module's perl source code just after the definition of the module's profile and the software will look there to find it at the 'remake' phase.
$out{extra_code}

$out{hidden_fields}

Note: This utility works only if the web-server's machine is connected to Internet.

Parsing Name	Regular Expression	From mark#	To mark#	Data Names (separated by spaces)	Order