I am working on a project which hits multiple URLs and scrapes its page every minute (Done using a thread pool). I am using Htmlunit β 2.20 for this. However, the timeout functionality does not serve my purpose properly.
Firstly, I am using WebClient.setTimeout() which is used twice, once for socket connection and again for data retrieval. It creates new socket connections for fetching each individual .js script and therefore defeats my purpose of setting an overall timeout for the whole page connect and fetch process. Secondly, I am also using webClient.setJavaScriptTimeout() but this also set the execution timeout for each js script execution again defeating my purpose of an overall timeout of Javascript. I have also used webClient.waitForBackgroundJavaScript() but it does not stop the Javascript execution even after the timeout specified.
Can someone help me out here with setting an overall timeout for Javascript execution. I have also tried using Future interface of Java and setting timeout for page fetching, but the constraint is that it creates a child thread for this execution. The service itself is running on thread pool with each URL fetching happening on a different thread. If I create a child thread for each thread in thread pool, it will increase memory consumption. Any thoughts on this or how to measure the increase in memory consumption will be really helpful. Most of the URLs present in Beta environment are stale, so cannot test it properly there. Any help is appreciated.
`
public String loadDocument(String url) {
try {
String content;
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
request = new WebRequest(new URL(url));
response = webClient.loadWebResponse(request);
webClient.waitForBackgroundJavaScript(JS_TIMEOUT);
webClient.setJavaScriptTimeout(JS_TIMEOUT);
page = webClient.loadWebResponseInto(response, webClient.getCurrentWindow());
Callable<Page> task = () -> {
page = webClient.loadWebResponseInto(response, webClient.getCurrentWindow());
return page;
};
future = executor.submit(task);
log.info("Thread count after submitting task "+Thread.activeCount());
page = future.get(JS_TIMEOUT, TimeUnit.SECONDS);
if (page instanceof TextPage) {
content = ((TextPage) page).getContent();
}
else {
content = ((HtmlPage) page).asXml();
}
log.info("Successfully retrieved content");
if (log.isTraceEnabled()) {
log.trace("Response content:\n " + content);
}
return content;
}
catch (TimeoutException e)
{
log.info("Timed out");
future.cancel(true);
return response.getContentAsString();
}
catch (Exception e) {
e.printStackTrace();
System.out.println("Failed to retrieve content because of exception "+e.getCause());
return null;
}
finally {
log.info("Closing web client");
log.info("Memory Used in kilo bytes"+ManagementFactory.getMemoryMXBean().getHeapMemoryUsage().getUsed()/1024);
webClient.close();
executor.shutdown();
}
}`
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too