Trouble with scraping JSP and jQuery site
OK, I realize that this might be a dumb question, but here goes...
I'm scraping a site that looks like it uses JSP and jQuery as the UI for a database; when I set up a Proxy Server, I am able to grab the first page of the site (...index.jsp) with associated .jsp files, but subsequent pages captured by the Proxy show (I think) only the data sent back by the database in a ...select.jsp page.
The data sent back in the ...select.jsp file for subsequent queries looks like this:
{"responseHeader":{"status":0,"QTime":5178,"params":{"facet":"true","sort":"score desc,RD desc","f.AD.facet.mincount":"1","f.OO.facet.mincount":"1","f.ED.facet.date.other":"all","f.AD.facet.date.gap":"+1YEAR","type":"branddb","hl":"true","fl":"BRAND,BRAND_EN,BRAND_FR,BRAND_ES,BRAND_AR,SOURCE,STATUS,score,OO,HOL,HOL_EN,HOL_FR,HOL_ES,HOL_AR,RD,VCS,USC,NC,IMG,ID","f.ED.facet.date.end":"NOW/DAY+1YEAR","f.STATUS.facet.limit":"20","f.ED.facet.mincount":"1","facet.field":["SOURCE","STATUS","OO"],"f.AD.facet.date.end":"NOW/YEAR+1YEAR","fq":["SOURCE:CATM","STATUS:ACT","AD:([2013-01-01T00:00:00Z TO 2013-12-31T23:59:59Z] [2012-01-01T00:00:00Z TO 2012-12-31T23:59:59Z] [2011-01-01T00:00:00Z TO 2011-12-31T23:59:59Z] [2010-01-01T00:00:00Z TO 2010-12-31T23:59:59Z])"],"hl.requireFieldMatch":"true","hl.fragsize":"5000","f.AD.facet.date.start":"NOW/YEAR-2500YEAR","f.ED.facet.limit":"20","facet.date":["AD","ED"],"f.AD.facet.date.other":"all","f.STATUS.facet.mincount":"1","json.nl":"map","f.SOURCE.facet.mincount":"1","hl.fl":"HOL,HOL_EN,HOL_FR,HOL_ES,HOL_AR","f.SOURCE.facet.limit":"20","wt":"json","rows":"10","f.ED.facet.date.gap":"+1MONTH","start":"0","q":"HOL:\"church dwight\"","f.ED.facet.date.start":"NOW/DAY-1MONTH","f.AD.facet.limit":"200"}},"response":{"numFound":52,"start":0,"maxScore":8.708218,"docs":[{"HOL":["CHURCH & DWIGHT CO., INC."],"RD":"2013-05-13T21:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1475305-00","OO":"CA","BRAND":["GLOBRUSH"],"NC":[21],"score":8.708218},{"HOL":["CHURCH & DWIGHT CO., INC."],"RD":"2013-05-13T21:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1475635-00","OO":"CA","BRAND":["ESCAPE"],"NC":[3],"score":8.708218},{"HOL":["CHURCH & DWIGHT CO., INC."],"RD":"2013-04-26T21:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1562982-00","OO":"CA","BRAND":["FOR A CRAZY SEXY FEEL"],"NC":[5],"score":8.708218},{"HOL":["CHURCH & DWIGHT CO., INC."],"RD":"2013-04-08T21:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1566095-00","OO":"CA","BRAND":["MUSIQUE À BOUCHE"],"NC":[21],"score":8.708218},{"HOL":["CHURCH & DWIGHT CO., INC."],"RD":"2013-04-03T21:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1580169-00","OO":"CA","VCS":["26.04.04","26.04.05","26.04.18","26.04.24"],"BRAND":["ORAJEL"],"NC":[3,5],"IMG":"1580169","score":8.708218},{"HOL":["CHURCH & DWIGHT CO., INC."],"RD":"2013-03-14T22:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1473541-00","OO":"CA","BRAND":["LYSINE WITHOUT LIMITS"],"NC":[5,31],"score":8.708218},{"HOL":["Church & Dwight Co., Inc."],"RD":"2013-03-14T22:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1529080-00","OO":"CA","BRAND":["L'IL CRITTERS"],"NC":[5],"score":8.708218},{"HOL":["Church & Dwight Co., Inc."],"RD":"2013-03-14T22:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1529082-00","OO":"CA","VCS":["27.01.01","27.01.12"],"BRAND":["L'ILCRITTERS"],"NC":[5],"IMG":"1529082","score":8.708218},{"HOL":["CHURCH & DWIGHT CO., INC."],"RD":"2013-03-12T22:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1557093-00","OO":"CA","BRAND":["ONE DAY, ONE DOSE, HEALING BEGINS."],"NC":[5],"score":8.708218},{"HOL":["CHURCH & DWIGHT CO., INC."],"RD":"2013-02-19T22:59:59Z","STATUS":"ACT","SOURCE":"CATM","ID":"CATM.1467309-00","OO":"CA","VCS":["02.09.14","14.07.01","26.01.01","26.01.05","26.01.14","26.01.16","26.01.21","26.01.24"],"BRAND":["ARM & HAMMER THE STANDARD OF PURITY"],"NC":[20,24],"IMG":"1467309","score":8.708218}]},"facet_counts":{"facet_queries":{},"facet_fields":{"SOURCE":{"CATM":52},"STATUS":{"act":52},"OO":{"ca":52}},"facet_dates":{"AD":{"2010-01-01T00:00:00Z":26,"2011-01-01T00:00:00Z":23,"2012-01-01T00:00:00Z":3,"gap":"+1YEAR","start":"0488-01-01T00:00:00Z","end":"2014-01-01T00:00:00Z","before":0,"after":0,"between":52},"ED":{"gap":"+1MONTH","start":"2013-04-16T00:00:00Z","end":"2014-05-16T00:00:00Z","before":0,"after":0,"between":0}},"facet_ranges":{}},"highlighting":{"CATM.1475305-00":{"HOL":["<em>CHURCH</em> & <em>DWIGHT</em> CO., INC."]},"CATM.1475635-00":{"HOL":["<em>CHURCH</em> & <em>DWIGHT</em> CO., INC."]},"CATM.1562982-00":{"HOL":["<em>CHURCH</em> & <em>DWIGHT</em> CO., INC."]},"CATM.1566095-00":{"HOL":["<em>CHURCH</em> & <em>DWIGHT</em> CO., INC."]},"CATM.1580169-00":{"HOL":["<em>CHURCH</em> & <em>DWIGHT</em> CO., INC."]},"CATM.1473541-00":{"HOL":["<em>CHURCH</em> & <em>DWIGHT</em> CO., INC."]},"CATM.1529080-00":{"HOL":["<em>Church</em> & <em>Dwight</em> Co., Inc."]},"CATM.1529082-00":{"HOL":["<em>Church</em> & <em>Dwight</em> Co., Inc."]},"CATM.1557093-00":{"HOL":["<em>CHURCH</em> & <em>DWIGHT</em> CO., INC."]},"CATM.1467309-00":{"HOL":["<em>CHURCH</em> & <em>DWIGHT</em> CO., INC."]}}}
This output stumped me...I guess I could parse it using Java, but was wondering if there's a way I should approach scraping these particular sites? Any advice here?
Thanks,
Justin
Personally, I like JSON data.
Personally, I like JSON data. It's more reliable than a page that might throw the unexpected at you. I often just make extractor patterns for it. You could make one like
Or there are lots of JSON libraries out there if you want to parse.
Thanks - can you advise on how to implement a class?
Hi Jason,
Thanks for the advice. I have started creating extractor patterns as you suggested, but I'm also looking into JSON libraries, and specifically Google's GSon library. AFAIK (and I don't know a lot) is that you have to create a class containing the elements you want to extract, for example:
private String BRAND; //This gets the value for the Brand element
private String HOL; //This gets the value for the Hol(der) element
// Getters and setters are not required for this example.
// GSON sets the fields directly.
@Override
public String toString() {
return BRAND + " - " + HOL;
}
}
You then use this class to extract information from the JSON output, like
import com.google.gson.GsonBuilder;
Gson gson = new GsonBuilder().create();
BrandInfo b = gson.fromJson(<JSON STRING GOES HERE!>, BrandInfo.class);
session.logInfo(b);
Now I vaguely remember that you or Scott had written up a tip about declaring classes within SS (vs. an external .jar), but
a) I can't find it and
b) I'm not sure it's applicable
Would you be able to point me in the right direction here on these points?
Thanks in advance and hope you all have a great day!
Regards,
Justin