SS feedback / suggestions


First I want to take a moment to thank you guys for a great product and excellent support. Keep up the great work!

I've used the product now for a bit and wanted to provide some feedback and suggestions:

  1. Contextual highlighting in extractor patterns - I'd like to see SS tokens highlighted so it would be easier to find them within the text.
    example: [email protected]@~ - the [email protected] and @~ might be in blue while NAME would be in red.
    Also, it might even be nice to give a visual clue that a token was a session variable.
  2. Editing a token in an extractor pattern shouldn't drop me to the last line of the extractor pattern - When editing tokens in an extractor pattern with multiple tokens, each save of a token takes me back to the bottom of the field requiring one to save, scroll up, edit, save, scroll up again to the next token, etc. If the cursor position could be remembered when you save the token that would rock and make editing faster.
  3. Include a sidebar or dropdown list of existing tokens for a scrapable page or extractor pattern - This would be a nice to have. Clicking on a token in the dropdown list or pattern sidebar would position your cursor on that token. Including a visual representation of a session variable would be awesome.
  4. extractor pattern sections display - Display extractor pattern sections horizontally vs vertically - So instead of this:
    main -> sub-Extractor Patterns -> Advanced
    Identifier: NameSequence: 1
    main -> sub-Extractor Patterns -> Advanced
    Identifier: IDSequence: 2
    main -> sub-Extractor Patterns -> Advanced
    Identifier: IDSequence: 2

    Have this:

    1. Name2. ID3. Address
    m->sub-E->a m->sub-E->a m->sub-E->a

    It is kind of hard to display with text rather than an image, but hopefully you get the idea where the tabs are across and the sequence number and name are at the top with the main, sub-extractor patterns, and advanced tabs below. I can try to mock something up if it isn't clear. The tabs that include the sequence number and identifier should be able to be drag and dropped into sequence position. It should also be possible to edit the sequence or identifier by double clicking on them.

    Using this style would reduce scrolling and make it easier/faster to see what patterns you currently have. This is also another area that I think could benefit from some state color coding - say if the pattern is going to be called from a script.

  5. The Tree View interaction is quirky - It is a little funky in that you'd expect it would work like any Mac/Windows tree view list and you could left click drag and drop an item into/on another. Instead it seems you must left click - (sometimes needing to click a second time) - wait for script to load in window, then drag and drop. You can't - as far as I can tell, drag and drop a folder at all.
  6. Code or Template Repository and script instances - I'm a single user, and rather new to the product, so perhaps what I'm envisioning wouldn't work well for most of your users, but it seems to me that a folder should limit the scope of what can be run to the scripts/sessions/scrapeable files within it (or its subfolders). Perhaps there could be another section above or below the tree view, that would be where you stored your script/session/scrapeable file templates. You could then drag and drop an instance of that template to a folder for use. You could still have a file of the same name then in a different folder and changes to the template could be distributed to like named copies if the user desired. I'd also like to see the folder be able to be exported as a whole.
  7. Checkbox for HTML Tidy on/off default in Options->Settings->General - I thought there was one previously, but now I can only find it on each scrapeable files Advanced tab, it would be nice if you could have a default setting in the main workbench settings window, but override the default on each file if needed.
  8. Trash Token - frequently I'm finding I have to make a trash token in an extractor pattern. I'd like to have a set token keyword like [email protected]@~ that would indicate that item was to be matched but not included in the dataset. Or as another option - instead of a keyword, have a checkbox option on the Token window for 'do not return to dataset' and let the user name it what they want.

Also, I saw a reference that the web interface was in the professional edition via:
~#ss pro edition install dir#~\resource\lws\webapps\ROOT\
I don't see that directory in my files. If its not available for the pro version any more will it possibly be later? I would love to have that option.

Ok, I think that was probably enough to dump on you guys at one go! :)

Thanks again!


final note: when doing an initial preview of my post the following warning was thrown:

 warning: Invalid argument supplied for foreach() in /var/www/html/ on line 70.

Sheila, It looks like you


It looks like you likely using version 4.5, but if you were to update to a pre-release you'll see that we already have several of these things.

You can turn off tidy globally in the file by setting TidyHTML to false.

We once had a token that would work like the "TRASH" you suggest, but we found that not seeing it was clumsy as we needed to verify that it matched only what was expected. Now I just have it in the dataSet so I can confirm it isn't greedy. If you want, the "IGNORE" tag still works, but I honestly think you're better off without it.

We'll look at the others and get back ...

add one vote for trash token

I'd like to add my vote to this. Though I'd prefer it was either a checkbox option in the token dialog or alternately a different token delimeter.
~$JUNK$~ instead of [email protected]@~

Perhaps adding a method to dataRecord and dataSet to clear trash tokens would be an alternative that allows you access the trash values for testing but an easy way to remove them. Then the test pattern dialog could have a button to clear them out as well.

I currently have my own external class that clears any token that matches JUNK[\d]+ which works quite ok but I'd prefer an easy way to set junk status as an attribute.

as per another suggestion thread I'd also love to see an option to insert regex direcly. Perhaps using an alternate token delimiter. Obviously without a token name you won't be able to access the match but for making patterns more flexible for matching it would be very handy. e.g I've come across quite a few spans that have random numbers of "&nbrsp;" entities (had to misspell that to make the entity visible in the post) around the data I want. Being able to do something like:
~$(&nbrsp;)[email protected]@~~$(&nbrsp;)+ would save a lot of data cleaning code in the scripts.

Thanks for the reply, Jason.

Thanks for the reply, Jason.

Are the pre-release versions generally stable enough to run in production? I'd be happy to change over if that is the case.

Thanks -


Most of the time they are,

Most of the time they are, and on the occasion that we do introduce a problem, we get it fixed ASAP. We generally used the newest versions in production here so we catch anything right away.