504 Error only when scraping session run from cron
Hi.
A scraping session that I it´s ready for production use, works seamlessly in my local mac. When I exported it and try to run it in a linux EC2 this very strange thing happens:
If I remotely access the linux server with ssh from a terminal window and run it directly with the following command, the scrape runs as it supposed to, with no errors:
myscrapeshellscript.sh just contains:
jre/bin/java -jar screen-scraper.jar -s "My Scraping Session"
Now comes the weird part:
If edit my crontab and put sh /home/myuser/screen-scraper_pro/myscrapeshellscript.sh to run at a specific time I can see in the debug messages that at the start of the whole execution, on the first webpage that is scraped, I get:
.
Using proxy server: 127.0.0.1:64182
Sending request.
My scrape: Warning! Received a status code of: 504.
My scrape: Processing scripts before all pattern applications.
My scrape: Extracting data for pattern "Categories"
My scrape: The pattern did not find any matches.
.
.
.
(Everything else is run but of course with no data scraped)
This happens consistently no matter at what time I do the test. The scraping session is perfectly run if I manually run it, but I get the 504 error when it is scheduled to run from the crontab.
What could be the reason?
Thank you,
Boga
The only thing I can think of
The only thing I can think of is the user running the process. It looks like you're using a proxy, so if the user running the cron job can't use the proxy that would explain it. Otherwise I'm baffled.
But the user running the
But the user running the process is the same whose crontab I am editing to have the scrape run scheduled. I edit it with command "crontab -e"
I tried with another simple
I tried with another simple scrape which basically connects to Tor & Polipo, scrapes one of those whatismyip websites to check the current ip address, then changes the tor identity and checks the ip adress again.
The same thing happens. It runs perfectly when run manually with sh but when running it scheduled after setting it up in my user´s crontab, I get that same 505 error. This is the full debug output:
Starting scraper.
Running scraping session: Tor - Test change identity
Processing scripts before scraping session begins.
Processing script: "common - Tor & Polipo SHUTDOWN"
NO TOR CONTROLLER INITIALIZED
Processing script: "common - Tor & Polipo START"
TESTING PORT 58495
TESTING PORT 40717
STARTING TOR
tor -f tor/torrc --SocksPort 58495 --ControlPort 40717 --DataDirectory tor/Tor_Test_change_identity8270190755521 > tor/Tor_Test_change_identity8270190755521/tor.log
TESTING PORT 31520
STARTING HTTPPROXY POLIPO
polipo -c tor/polipo.conf socksParentProxy=127.0.0.1:58495 proxyPort=31520 logFile=tor/Tor_Test_change_identity8270190755521/polipo.log
Tor & Polipo started
Processing script: "Scrape what is my ip address"
Scraping file: "what is my ip address"
what is my ip address: Resolved URL: http://whatismyipaddress.com/
Using proxy server: 127.0.0.1:31520
what is my ip address: Sending request.
what is my ip address: Warning! Received a status code of: 504.
what is my ip address: Extracting data for pattern "my ip address"
what is my ip address: The pattern did not find any matches.
what is my ip address: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
My current ip address is: null
Processing script: "common - Tor new identity"
connecting to tor at port 40717
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(Unknown Source)
at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
at java.net.PlainSocketImpl.connect(Unknown Source)
at java.net.SocksSocketImpl.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at java.net.Socket.connect(Unknown Source)
at java.net.Socket.<init>(Unknown Source)
at java.net.Socket.<init>(Unknown Source)
at com.ryanjustus.sstorcontrol.SSTorController.requestNewIdentity(SSTorController.java:227)
at com.ryanjustus.sstorcontrol.SSTorController.requestNewIdentity(SSTorController.java:266)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at bsh.Reflect.invokeMethod(Unknown Source)
at bsh.Reflect.invokeObjectMethod(Unknown Source)
at bsh.Name.invokeMethod(Unknown Source)
at bsh.BSHMethodInvocation.eval(Unknown Source)
at bsh.BSHPrimaryExpression.eval(Unknown Source)
at bsh.Interpreter.eval(Unknown Source)
at bsh.Interpreter.eval(Unknown Source)
at bsh.Interpreter.eval(Unknown Source)
at com.screenscraper.scraper.ScriptContext$ScriptRunner.run(ScriptContext.java:352)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
New Tor identity acquired
Processing script: "Scrape what is my ip address"
Scraping file: "what is my ip address"
what is my ip address: Resolved URL: http://whatismyipaddress.com/
Setting referer to: http://whatismyipaddress.com/
Using proxy server: 127.0.0.1:31520
what is my ip address: Sending request.
what is my ip address: Warning! Received a status code of: 504.
what is my ip address: Extracting data for pattern "my ip address"
what is my ip address: The pattern did not find any matches.
what is my ip address: Warning! No matches were made by any of the extractor patterns associated with this scrapeable file.
My current ip address is: null
Processing scripts after scraping session has ended.
Processing script: "common - Tor & Polipo SHUTDOWN"
shutting down polipo
STOPPING HTTPPROXY POLIPO
shutting down tor
STOPPING TOR
everything shut down
Tor & Polipo shut down
Processing scripts always to be run at the end.
Scraping session "Tor - Test change identity" finished.
sorry, I meant 504
sorry, I meant 504 error.
boga.
More information, in case
More information, in case it´s relevant...If I rebooted the EC2 and first thing after reboot I checked what processes were running with
ps aux | less
I could see these processes already running, which I am not sure they should be?Now I have removed tor and polipo from the etc/init.d folder and I have rebooted so that polipo and tor are not automatically started at boot.
However, I made the test again and I keep on getting the 504 error when running scripts through my user´s cron
When you 'run it manually',
When you 'run it manually', do you mean that you're running from the workbench?
Are you passing in any parameters when you run from the command line? Could you add some logging to your scrape to verify the variables are set as you expect?
by run it manually I mean run
by run it manually I mean run it in the Linux EC2 connecting remotely from my mac through a terminal window by ssh and running from the command line the same shell script that I am having problems when it runs from that same user´s cron in the Linux EC2. You can see it specified at the start of my thread what that very simple shell script contains. As you can see, I am not passing any parameters.
So when I run
from the command line, it works. It scrapes the pages, grabs the data…inserts it in the database, etc…
but when I run it from the crontab:
on the first webpage scraped a 504 error is thrown and the rest of the script runs but of course without any extractor pattern match, without inserting anything in the database, etc...
And in regards to logging, I am using the "debug" output and as you can see above the thread. I have also tested with that other more simple scrape whose only purpose is to check the ip address before and after changing tor´s identity, and as is shown in the debug output that I pasted, the 504 error happens right after the first request for the website is sent.
Do you have anything in mind that could help pinpoint or locate what´s going on?
Thank you very much,
boga
Doing some googling I found
Doing some googling I found out that normally the cron doesn´t set the same environment variables than the shell, so the same script was being executed by the cron with a different PATH environment variable than when the script was being executed manually from the command line.
The path set by the cron was:
/usr/bin:/bin
while the path set by the shell was:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
To fix this, at the start of the crontab of my user I manually set the PATH to match the one of the shell.
Now the cron execution of the script works also. No more 504 errors.
cheers,
boga