Oddity with ReadDelimitedText's notion of newlines?
My environment is admittedly confusing: I'm using a Windows machine, but doing all my development via Cygwin. Mapping between Windows and Unix path delimiters and newlines seems to be troublesome.
I have downloaded a CSV dataset that uses Windows newline conventions and written an incredibly simply DataRush application that simply reads in the file using the ReadDelimitedText operator. Upon attempting to run it under Cygwin using Ant, I receive:
test:
[junit] Running ds2toderby.Ds2ToDerbyTestCase
[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.863 sec
[junit] Testsuite: ds2toderby.Ds2ToDerbyTestCase
[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.863 sec
[junit] ------------- Standard Error -----------------
[junit] 2006-11-14 13:48:58.768 INFO ds2toderby.Ds2ToDerbyTestCase.runJob --
run application ds2toderby.Ds2ToDerby using ds2toderby.properties
[junit] ------------- ---------------- ---------------
[junit] Testcase: testDs2ToDerby took 1.863 sec
[junit] Caused an ERROR
[junit] Ds2ToDerby
[junit] com.pervasive.dataflow.dev.DFCompositeException: Ds2ToDerby{Ds2ToDer
by.readFlatFile.parse=java.lang.RuntimeException: parse error, position in file:
1529726, line number: 1, position in line: 1529726}
[junit] Ds2ToDerby.readFlatFile.parse
[junit] java.lang.RuntimeException: parse error, position in file: 1529726,
line number: 1, position in line: 1529726
[junit] at com.pervasive.dataflow.operators.io.textfile.ParseDelimitedFi
eldsProcess.run(ParseDelimitedFieldsProcess.java:109)
[junit] at com.pervasive.dataflow.executor.OperatorTask.runTask(Operator
Task.java:192)
[junit] at com.pervasive.dataflow.executor.Task.run(Task.java:93)
[junit] Caused by: java.text.ParseException: record arity = 190001, expected
arity = 20
[junit] at com.pervasive.dataflow.operators.io.textfile.ParseDelimitedFi
eldsProcess.parseFields(ParseDelimitedFieldsProcess.java:262)
[junit] at com.pervasive.dataflow.operators.io.textfile.ParseDelimitedFi
eldsProcess.run(ParseDelimitedFieldsProcess.java:106)
[junit] ... 2 more
However, if I set the "recordSeparator" attribute for the ReadDelimitedText operator as follows:
<Property name="recordSeparator" type="string" default="\n">
<Description>
Character or characters demarcating the end of a line
</Description>
</Property>
(I'm overriding the default with "\n" ) Then all is well:
test:
[junit] Running ds2toderby.Ds2ToDerbyTestCase
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.654 sec
[junit] Testsuite: ds2toderby.Ds2ToDerbyTestCase
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.654 sec
[junit] ------------- Standard Error -----------------
[junit] 2006-11-14 13:32:04.265 INFO ds2toderby.Ds2ToDerbyTestCase.runJob --
run application ds2toderby.Ds2ToDerby using ds2toderby.properties
[junit] 2006-11-14 13:32:06.909 INFO Ds2ToDerby.readFlatFile.parse.run total
record count: 10000
[junit] ------------- ---------------- ---------------
[junit] Testcase: testDs2ToDerby took 2.654 sec
Setting the default="" causes the same error as above. Apparently, the operating system default newline value is not being substituted by the assembly as claimed in the documentation: "recordSeparator -- Character sequence identifying line/record break. Default is blank, filled in by assembly with system-defined line separator."
I am inferring this by noting that the original test attempts to parse the entire file as a single line, while the using the overridden default causes the operator to correctly identify 10k distinct records separated by newlines.
Could this be an issue in ReadDelimitedText's interpretation of the system-defined line separator?
Trackback URL for this post:
- reply




