I recently upgraded to 2.35 and 2.36 using htmlunit as a crawling unit. This was not happening for 2.33. After crawling a lot of pages, I started to see tons of the following threads in the thread dump and eventually it eats up system resource and hangs the container (in docker).
"WebSocketClient@1404341633-126276" #126276 daemon prio=5 os_prio=0 tid=0x00007faba4228800 nid=0xbe7 runnable [0x00007fa1ec102000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x00007fba17c3b6f0> (a sun.nio.ch.Util$3)
- locked <0x00007fba17c3b6d8> (a java.util.Collections$UnmodifiableSet)
- locked <0x00007fba17c3b528> (a sun.nio.ch.EPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:464)
at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:401)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produceTask(EatWhatYouKill.java:357)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:181)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:132)
at org.eclipse.jetty.io.ManagedSelector$$Lambda$71/1759199424.run(Unknown Source)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:786)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:743)
at java.lang.Thread.run(Thread.java:748)
"WebSocketClient@1404341633-126275" #126275 daemon prio=5 os_prio=0 tid=0x00007faba403b000 nid=0xbdb waiting on condition [0x00007fa28901c000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fba17d68fd8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:727)
at java.lang.Thread.run(Thread.java:748)
The crawling code snippet looks like following. Because certain web access could get very slow, I create a future task for the crawling and "cancel" it.
.....
ExecutorService executor = Executors.newSingleThreadExecutor();
Future future = executor.submit(new HtmlunitCrawl(urlWithProto, timeout, useProxy));
.....
} finally {
future.cancel(true);
executor.shutdownNow();
The crawling code:
java.util.logging.Logger.getLogger("com.gargoylesoftware.htmlunit").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
java.util.logging.Logger.getLogger("org.apache.commons.httpclient").setLevel(Level.OFF);
WebClient webClient = new WebClient(BROWSER_VERSION);
webClient.getOptions().setTimeout(timeout*1000);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setPopupBlockerEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setPrintContentOnFailingStatusCode(false);
webClient.setJavaScriptTimeout(sJavascriptTimeout*1000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.waitForBackgroundJavaScript(sJavascriptTimeout*1000);
webClient.setScriptPreProcessor(new PandoraHtmlunitScriptPreprocessor());
// webClient.setRefreshHandler(new ThreadedRefreshHandler());
webClient.setRefreshHandler(new WaitingRefreshHandler(timeout));
.....
PandoraWebConnection conn = new PandoraWebConnection(webClient);
webClient.setWebConnection(conn);
RedirectChain rc = new RedirectChain();
rc.entryUrl = url;
PandoraWebConnection.REDIRECT_TABLE.put(Thread.currentThread().getName(), rc);
**aPage = webClient.getPage(urlWithProto);**
Thread.sleep(sCrawlerWaitTime*1000);
crawlResp = conn.getLastResponse();
// System.out.println("Sleeping 6 seconds for page to fully load");
// Thread.sleep(6000);
PandoraHtmlunitScriptPreprocessor.RUNNING_SCRIPTS.remove(Thread.currentThread().getName());
} finally {
try {
**webClient.close();**
} catch (Exception ex) {
ex.printStackTrace();
}
}
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too