Read multiline JSON in Apache Spark











up vote
11
down vote

favorite
7












I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:



val df = sqlCtx.read.json("/path/to/user.json")
df.registerTempTable("user_tt")

val info = sqlCtx.sql("SELECT name FROM user_tt")
info.show()


df.printSchema() result:



root
|-- _corrupt_record: string (nullable = true)


My JSON file:



{
"id": 1,
"name": "Morty",
"age": 21
}


Exeption:



Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];


How can I fix it?



UPD



_corrupt_record is



+--------------------+
| _corrupt_record|
+--------------------+
| {|
| "id": 1,|
| "name": "Morty",|
| "age": 21|
| }|
+--------------------+


UPD2



It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.



{"id": 1, "name": "Morty", "age": 21}


So the problem is in a newline.



UPD3



I found in docs the next sentence:




Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.




It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?










share|improve this question




























    up vote
    11
    down vote

    favorite
    7












    I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:



    val df = sqlCtx.read.json("/path/to/user.json")
    df.registerTempTable("user_tt")

    val info = sqlCtx.sql("SELECT name FROM user_tt")
    info.show()


    df.printSchema() result:



    root
    |-- _corrupt_record: string (nullable = true)


    My JSON file:



    {
    "id": 1,
    "name": "Morty",
    "age": 21
    }


    Exeption:



    Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];


    How can I fix it?



    UPD



    _corrupt_record is



    +--------------------+
    | _corrupt_record|
    +--------------------+
    | {|
    | "id": 1,|
    | "name": "Morty",|
    | "age": 21|
    | }|
    +--------------------+


    UPD2



    It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.



    {"id": 1, "name": "Morty", "age": 21}


    So the problem is in a newline.



    UPD3



    I found in docs the next sentence:




    Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.




    It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?










    share|improve this question


























      up vote
      11
      down vote

      favorite
      7









      up vote
      11
      down vote

      favorite
      7






      7





      I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:



      val df = sqlCtx.read.json("/path/to/user.json")
      df.registerTempTable("user_tt")

      val info = sqlCtx.sql("SELECT name FROM user_tt")
      info.show()


      df.printSchema() result:



      root
      |-- _corrupt_record: string (nullable = true)


      My JSON file:



      {
      "id": 1,
      "name": "Morty",
      "age": 21
      }


      Exeption:



      Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];


      How can I fix it?



      UPD



      _corrupt_record is



      +--------------------+
      | _corrupt_record|
      +--------------------+
      | {|
      | "id": 1,|
      | "name": "Morty",|
      | "age": 21|
      | }|
      +--------------------+


      UPD2



      It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.



      {"id": 1, "name": "Morty", "age": 21}


      So the problem is in a newline.



      UPD3



      I found in docs the next sentence:




      Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.




      It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?










      share|improve this question















      I was trying to use a JSON file as a small DB. After creating a template table on DataFrame I queried it with SQL and got an exception. Here is my code:



      val df = sqlCtx.read.json("/path/to/user.json")
      df.registerTempTable("user_tt")

      val info = sqlCtx.sql("SELECT name FROM user_tt")
      info.show()


      df.printSchema() result:



      root
      |-- _corrupt_record: string (nullable = true)


      My JSON file:



      {
      "id": 1,
      "name": "Morty",
      "age": 21
      }


      Exeption:



      Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'name' given input columns: [_corrupt_record];


      How can I fix it?



      UPD



      _corrupt_record is



      +--------------------+
      | _corrupt_record|
      +--------------------+
      | {|
      | "id": 1,|
      | "name": "Morty",|
      | "age": 21|
      | }|
      +--------------------+


      UPD2



      It's weird, but when I rewrite my JSON to make it oneliner, everything works fine.



      {"id": 1, "name": "Morty", "age": 21}


      So the problem is in a newline.



      UPD3



      I found in docs the next sentence:




      Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.




      It isn't convenient to keep JSON in such format. Is there any workaround to get rid of multi-lined structure of JSON or to convert it in oneliner?







      json apache-spark apache-spark-sql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited May 20 '17 at 9:30









      zero323

      159k39460562




      159k39460562










      asked Jul 23 '16 at 19:50









      Finkelson

      94431336




      94431336
























          2 Answers
          2






          active

          oldest

          votes

















          up vote
          29
          down vote



          accepted










          Spark >= 2.2



          Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:



          spark.read
          .option("multiLine", true).option("mode", "PERMISSIVE")
          .json("/path/to/user.json")


          See:





          • SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).


          • SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.


          Spark < 2.2



          Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.



          It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.



          That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:



          spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())





          share|improve this answer






























            up vote
            4
            down vote













            Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).



            Therefore, the correct syntax is now:



            spark.read
            .option("multiLine", true).option("mode", "PERMISSIVE")
            .json("/path/to/user.json")


            This happened in https://issues.apache.org/jira/browse/SPARK-20980.






            share|improve this answer





















            • "multiline" option worked for me. Thanks Dan!
              – Omkar
              Feb 1 at 18:36










            protected by Mohammad Usman May 11 at 11:56



            Thank you for your interest in this question.
            Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



            Would you like to answer one of these unanswered questions instead?














            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            29
            down vote



            accepted










            Spark >= 2.2



            Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:



            spark.read
            .option("multiLine", true).option("mode", "PERMISSIVE")
            .json("/path/to/user.json")


            See:





            • SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).


            • SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.


            Spark < 2.2



            Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.



            It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.



            That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:



            spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())





            share|improve this answer



























              up vote
              29
              down vote



              accepted










              Spark >= 2.2



              Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:



              spark.read
              .option("multiLine", true).option("mode", "PERMISSIVE")
              .json("/path/to/user.json")


              See:





              • SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).


              • SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.


              Spark < 2.2



              Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.



              It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.



              That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:



              spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())





              share|improve this answer

























                up vote
                29
                down vote



                accepted







                up vote
                29
                down vote



                accepted






                Spark >= 2.2



                Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:



                spark.read
                .option("multiLine", true).option("mode", "PERMISSIVE")
                .json("/path/to/user.json")


                See:





                • SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).


                • SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.


                Spark < 2.2



                Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.



                It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.



                That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:



                spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())





                share|improve this answer














                Spark >= 2.2



                Spark 2.2 introduced wholeFile multiLine option which can be used to load JSON (not JSONL) files:



                spark.read
                .option("multiLine", true).option("mode", "PERMISSIVE")
                .json("/path/to/user.json")


                See:





                • SPARK-18352 - Parse normal, multi-line JSON files (not just JSON Lines).


                • SPARK-20980 - Rename the option wholeFile to multiLine for JSON and CSV.


                Spark < 2.2



                Well, using JSONL formated data may be inconvenient but it I will argue that is not the issue with API but the format itself. JSON is simply not designed to be processed in parallel in distributed systems.



                It provides no schema and without making some very specific assumptions about its formatting and shape it is almost impossible to correctly identify top level documents. Arguably this is the worst possible format to imagine to use in systems like Apache Spark. It is also quite tricky and typically impractical to write valid JSON in distributed systems.



                That being said, if individual files are valid JSON documents (either single document or an array of documents) you can always try wholeTextFiles:



                spark.read.json(sc.wholeTextFiles("/path/to/user.json").values())






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Jul 23 at 16:39









                Community

                11




                11










                answered Jul 24 '16 at 12:42









                zero323

                159k39460562




                159k39460562
























                    up vote
                    4
                    down vote













                    Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).



                    Therefore, the correct syntax is now:



                    spark.read
                    .option("multiLine", true).option("mode", "PERMISSIVE")
                    .json("/path/to/user.json")


                    This happened in https://issues.apache.org/jira/browse/SPARK-20980.






                    share|improve this answer





















                    • "multiline" option worked for me. Thanks Dan!
                      – Omkar
                      Feb 1 at 18:36















                    up vote
                    4
                    down vote













                    Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).



                    Therefore, the correct syntax is now:



                    spark.read
                    .option("multiLine", true).option("mode", "PERMISSIVE")
                    .json("/path/to/user.json")


                    This happened in https://issues.apache.org/jira/browse/SPARK-20980.






                    share|improve this answer





















                    • "multiline" option worked for me. Thanks Dan!
                      – Omkar
                      Feb 1 at 18:36













                    up vote
                    4
                    down vote










                    up vote
                    4
                    down vote









                    Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).



                    Therefore, the correct syntax is now:



                    spark.read
                    .option("multiLine", true).option("mode", "PERMISSIVE")
                    .json("/path/to/user.json")


                    This happened in https://issues.apache.org/jira/browse/SPARK-20980.






                    share|improve this answer












                    Just to add on to zero323's answer, the option in Spark 2.2+ to read multi-line JSON was renamed to multiLine (see the Spark documentation here).



                    Therefore, the correct syntax is now:



                    spark.read
                    .option("multiLine", true).option("mode", "PERMISSIVE")
                    .json("/path/to/user.json")


                    This happened in https://issues.apache.org/jira/browse/SPARK-20980.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Dec 29 '17 at 15:42









                    Dan Coates

                    917




                    917












                    • "multiline" option worked for me. Thanks Dan!
                      – Omkar
                      Feb 1 at 18:36


















                    • "multiline" option worked for me. Thanks Dan!
                      – Omkar
                      Feb 1 at 18:36
















                    "multiline" option worked for me. Thanks Dan!
                    – Omkar
                    Feb 1 at 18:36




                    "multiline" option worked for me. Thanks Dan!
                    – Omkar
                    Feb 1 at 18:36





                    protected by Mohammad Usman May 11 at 11:56



                    Thank you for your interest in this question.
                    Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                    Would you like to answer one of these unanswered questions instead?



                    Popular posts from this blog

                    Catalogne

                    Violoncelliste

                    Héron pourpré