resolving WstxUnexpectedCharException
Just got into this exception when parsing news articles from the web:
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException:The problems appeared to be a control character in one of the non English articles. To solve the problem simply remove the control chars from the text using:
Illegal character ((CTRL-CHAR, code 19))
at [row,col {unknown-source}]: [1186,417]
at com.ctc.wstx.sr.StreamScanner.throwInvalidSpace
at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary
at com.ctc.wstx.sr.BasicStreamReader.finishToken
at com.ctc.wstx.sr.BasicStreamReader.next
at org.codehaus.stax2.ri.Stax2EventReaderImpl.peek
str.replaceAll("\\p{Cntrl}", "")







2 comments:
Actually it should be replaceAll("[\\x00-\\x09\\x11\\x12\\x14-\\x1F\\x7F]", "") if you wish to keep the CR/LF
Should be:
xml.replaceAll("[\\x00-\\x09\\x0B\\x0C\\x0E-\\x1F\\x7F]", "")
Post a Comment