Thursday, April 09, 2009

resolving WstxUnexpectedCharException

Just got into this exception when parsing news articles from the web:

Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: 
Illegal character ((CTRL-CHAR, code 19))
at [row,col {unknown-source}]: [1186,417]
at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary
at com.ctc.wstx.sr.BasicStreamReader.finishToken
at com.ctc.wstx.sr.BasicStreamReader.next
at org.codehaus.stax2.ri.Stax2EventReaderImpl.peek
The problems appeared to be a control character in one of the non English articles. To solve the problem simply remove the control chars from the text using:
str.replaceAll("\\p{Cntrl}", "")

2 comments:

Eishay Smith May 6, 2009 at 11:47 AM  

Actually it should be replaceAll("[\\x00-\\x09\\x11\\x12\\x14-\\x1F\\x7F]", "") if you wish to keep the CR/LF

Claudio Rossetto January 19, 2012 at 9:32 AM  

Should be:
xml.replaceAll("[\\x00-\\x09\\x0B\\x0C\\x0E-\\x1F\\x7F]", "")

Creative Commons License This work by Eishay Smith is licensed under a Creative Commons Attribution 3.0 Unported License.