Dynamic extractor patterns, or session/other variables' values inserted into extractor patterns...

Goal is to deploy very specific extractor patterns that select, for example, text on either side of a known string such as "Quercus rubra". This is the latin name for the common name of Red Oak, but there are many other common names used. The goal would be to have a pattern such as ~#LATINNAME#~ that reads in the session variable "Quercus rubra" that has been set from .txt file.

The .txt file would include hundreds if not thousands of other latin names, i.e.:
Acer negundo
Acer rubrum
Acer saccharinum
Acer saccharum
Amelanchier canadensis
Betula lutea
Betula nigra
Betula papyrifera
Carpinus caroliniana
Carya cordiformis
Carya ovata
Celtis occidentalis
Crataegus sp.
Fagus grandifolia
Fraxinus americana
Fraxinus nigra
Fraxinus pennsylvanica var. lanceolata
Fraxinus quadrangulata
Gleditsia triancanthos
Gymnocladus dioica
Juglans cinerea
Juglans nigra
Morus rubra
Ostrya virginiana
Populus balsamifera
Populus deltoides
Populus grandidentata
Populus tremuloides
Prunus americana
Prunus pennsylvanica
Prunus serotina
Prunus virginiana
Quercus alba
Quercus bicolor
Quercus ellipsoidalis
Quercus macrocarpa
Quercus muehlenbergii
Quercus rubra
Quercus velutina
Robinia pseudoacacia
Salix sp.
Sorbus americana
Tilia americana
Toxicodendron vernix
Ulmus americana
Ulmus rubra
Ulmus thomasii

Creating a specific extractor pattern for each such entry, e.g. [[[ ~@COMMONNAME@~ Sorbus americana ]]] would be time prohibitive...

Two ways I can see attempting this are:

(1) Utilize a ~#LATINNAME#~ variable input in definition of extractor pattern e.g. [[[ ~@COMMONNAME@~ ~#LATINNAME#~ ]]], or

(2) Allow dynamic creation and input of extractor patterns on the fly (?)

I only saw one previous forum question on this but did not find any replies... the only thing I found seemingly of relevance was reference to "deprecated: embedded session variables in extractor patterns" in upgrade report.

Thanks!

Hi, You could have a script

Hi,

You could have a script that initalises the variable before every scrape and then loop a call for the scrape after each scrape. So the script would be

if(session.getVariable("SCRAPE_NUM") == null)
{
session.setVariable("SCRAPE_NUM") = "1";
session.setVariable("Name_of_Tree") = "Red Oak";
session.scrapeFile("ScrapeFileName");
}
else if(session.getVariable("SCRAPE_NUM") == "1")
{
session.setVariable("SCRAPE_NUM") = "2";
session.setVariable("Name_of_Tree") = "Red Oak 2";
session.scrapeFile("ScrapeFileName");
}

And so on. Then use the advances tab in the extractor pattern and tick create automatic dataset. If this does not work you could create a dataset manually after each extractor pattern does it's stuff.

That's a good suggestion,

That's a good suggestion, Seamus. I would just add a few ideas for efficiency.

If you're iterating through possibly thousands of names from an input file and the target site you're looking to extract from has many names on any given page then you may want to first save down the page using session.getContentAsString. Then, attempt to match your names on each page as a local file (just replace the URL of your scrapeableFile with a local file path). This will save you and the target server thousands of unnecessary requests.

Also, be cautious when selecting "Automatically save the data set generated by this extractor pattern in a session variable". Doing so can quickly fill up the memory allocated to screen-scraper. As an alternative, you may choose to save any matching results to another text file, or set any individual session variables to null after each loop. This would spare your RAM.

-Scott

Additional clarification for my question...

Hello, thank you both for your timely responses!

I learned much from your suggestions, yet am thinking I was not clear enough in my question. I'll try adding a bit more here...

The common name of the plant i.e. Red Oak (which is the data to be extracted) is inherently spatially-related to its latin name of Quercus rubra (the data string known going in).

The spatial relationship is almost always one of immediate adjacency, e.g. "Red Oak (Quercus rubra)", or it may be nearby but separated by price, size, other specifications in what's known as a plant list.

(The differing presentations of adjacency I hope to catch using Regular Expressions.)

Seemingly the challenge is to place the value of e.g. "Quercus rubra" into an "Extractor Pattern Input Variable" such as ~#LATINNAME#~, just as SS provides the capability for doing so in an Url with ~#SEARCHTERM#~, ~#PAGE#~, etc.

A simplistic result (without benefit of regular expression, catching only one type of representation) would be an extractor pattern something like "~@COMMONNAME@~(~#LATINNAME#~", where the extractor pattern's definition catches the adjacency of the two pieces of data...

My write-out-to-file would build rows perhaps like:

Red Oak, Quercus rubra
Common Oak, Quercus rubra
Southern Oak, Quercus rubra
Hard Maple, Acer saccharum
Sugar Maple, Acer saccharum

being a collection of the various common names found to be used for the one latin name.

In Seamus' suggestion I wasn't able to figure out how the ~#LATINNAME#~ variable, once set using the scripts he suggested, could be deployed (a) within an extractor pattern defined just once, or (b) multitudinous (fancy word of the day) extractor patterns could be dynamically created (at least within SS), each containing a different value hardcoded into the extractor pattern.

If I can get this all going, however, all the remaining content in your suggestions will be quite valuable!

Thx

Because you're not able to

Because you're not able to load a regular expression into an extractor pattern token on-the-fly I believe you're going to need to accomplish this either in a script or outside of screen-scraper.

I can picture a script that takes a good portion of the HTML from the last response, performs a string match (substring) looking for the COMMON_NAME. Once found, it grabs (indexOf) a distance of a variable number of characters left and right of the COMMON_NAME. Then performs a string match on that string looking for the presence of the LATIN_NAME. If the LATIN_NAME is found in proximity of the COMMON_NAME then they are considered a match.

You'll end up making heavy use of different methods available in the Java String Class. Particularly indexOf, substring, and length. You can even use regex since the matches method takes regex as a parameter.

Here are a few example scripts that use some of the same methods you'll be using. The last in the list, Square Footage Catcher, is attempting to do something similar. It is looking for the number of square feet on a given web page.

They are faced with a similar issue as you are. For them, how the square footage is written can vary from site to site. For you, the formatting of text and markup between the COMMON_NAME and the LATIN_NAME can vary. Hopefully the solution to your challenge won't be quite as involved as theres was.

-Scott