Read multiline JSON in Apache Spark
up vote
11
down vote
favorite
I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:
val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")
val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()
df.printSchema() result:
root
|-- _corrupt_record: string (nullable = true)
My JSON file:
{
"id": 1,
"name": "Morty",
"age": 21
}
Exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];
How can I fix it?
UPD
_corrupt_record is
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+
UPD2
It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.
{"id": 1, "name": "Morty", "age": 21}
So the problem is in a newline.
UPD3
I found in docs the next sentence:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?
json apache-spark apache-spark-sql
add a comment |
up vote
11
down vote
favorite
I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:
val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")
val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()
df.printSchema() result:
root
|-- _corrupt_record: string (nullable = true)
My JSON file:
{
"id": 1,
"name": "Morty",
"age": 21
}
Exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];
How can I fix it?
UPD
_corrupt_record is
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+
UPD2
It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.
{"id": 1, "name": "Morty", "age": 21}
So the problem is in a newline.
UPD3
I found in docs the next sentence:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?
json apache-spark apache-spark-sql
add a comment |
up vote
11
down vote
favorite
up vote
11
down vote
favorite
I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:
val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")
val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()
df.printSchema() result:
root
|-- _corrupt_record: string (nullable = true)
My JSON file:
{
"id": 1,
"name": "Morty",
"age": 21
}
Exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];
How can I fix it?
UPD
_corrupt_record is
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+
UPD2
It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.
{"id": 1, "name": "Morty", "age": 21}
So the problem is in a newline.
UPD3
I found in docs the next sentence:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?
json apache-spark apache-spark-sql
I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:
val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")
val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()
df.printSchema() result:
root
|-- _corrupt_record: string (nullable = true)
My JSON file:
{
"id": 1,
"name": "Morty",
"age": 21
}
Exeption:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];
How can I fix it?
UPD
_corrupt_record is
+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+
UPD2
It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.
{"id": 1, "name": "Morty", "age": 21}
So the problem is in a newline.
UPD3
I found in docs the next sentence:
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?
json apache-spark apache-spark-sql
json apache-spark apache-spark-sql
edited May 20 '17 at 9:30
zero323
159k39460562
159k39460562
asked Jul 23 '16 at 19:50
Finkelson
94431336
94431336
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
29
down vote
accepted
Spark >= 2.2
Spark 2.2 introduced wholeFilemultiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
See:
SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the optionwholeFiletomultiLinefor JSON and CSV.
Spark < 2.2
Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.
It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.
That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:
spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())
add a comment |
up vote
4
down vote
Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).
Therefore, the correct syntax is now:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
This happened in https://issues.apache.org/jira/browse/SPARK-20980.
"multiline" option worked for me. Thanks Dan!
– Omkar
Feb 1 at 18:36
add a comment |
protected by Mohammad Usman May 11 at 11:56
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
29
down vote
accepted
Spark >= 2.2
Spark 2.2 introduced wholeFilemultiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
See:
SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the optionwholeFiletomultiLinefor JSON and CSV.
Spark < 2.2
Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.
It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.
That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:
spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())
add a comment |
up vote
29
down vote
accepted
Spark >= 2.2
Spark 2.2 introduced wholeFilemultiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
See:
SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the optionwholeFiletomultiLinefor JSON and CSV.
Spark < 2.2
Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.
It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.
That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:
spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())
add a comment |
up vote
29
down vote
accepted
up vote
29
down vote
accepted
Spark >= 2.2
Spark 2.2 introduced wholeFilemultiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
See:
SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the optionwholeFiletomultiLinefor JSON and CSV.
Spark < 2.2
Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.
It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.
That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:
spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())
Spark >= 2.2
Spark 2.2 introduced wholeFilemultiLine option which can be used to load JSON (not JSONL) files:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
See:
SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).
SPARK-20980 - Rename the optionwholeFiletomultiLinefor JSON and CSV.
Spark < 2.2
Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.
It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.
That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:
spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())
edited Jul 23 at 16:39
Community♦
11
11
answered Jul 24 '16 at 12:42
zero323
159k39460562
159k39460562
add a comment |
add a comment |
up vote
4
down vote
Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).
Therefore, the correct syntax is now:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
This happened in https://issues.apache.org/jira/browse/SPARK-20980.
"multiline" option worked for me. Thanks Dan!
– Omkar
Feb 1 at 18:36
add a comment |
up vote
4
down vote
Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).
Therefore, the correct syntax is now:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
This happened in https://issues.apache.org/jira/browse/SPARK-20980.
"multiline" option worked for me. Thanks Dan!
– Omkar
Feb 1 at 18:36
add a comment |
up vote
4
down vote
up vote
4
down vote
Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).
Therefore, the correct syntax is now:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
This happened in https://issues.apache.org/jira/browse/SPARK-20980.
Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).
Therefore, the correct syntax is now:
spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/path/to/user.json")
This happened in https://issues.apache.org/jira/browse/SPARK-20980.
answered Dec 29 '17 at 15:42
Dan Coates
917
917
"multiline" option worked for me. Thanks Dan!
– Omkar
Feb 1 at 18:36
add a comment |
"multiline" option worked for me. Thanks Dan!
– Omkar
Feb 1 at 18:36
"multiline" option worked for me. Thanks Dan!
– Omkar
Feb 1 at 18:36
"multiline" option worked for me. Thanks Dan!
– Omkar
Feb 1 at 18:36
add a comment |
protected by Mohammad Usman May 11 at 11:56
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?