Vader Sentiment with multiple PDF











up vote
0
down vote

favorite












I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.



from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()


I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.



from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("n"):
vs=analyzer.polarity_scores(line)


I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you










share|improve this question


























    up vote
    0
    down vote

    favorite












    I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.



    from PyPDF2 import PdfFileReader, PdfFileWriter
    pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
    newfile=open('rjtjj.txt','w')
    pdf_reader= PdfFileReader (pdf_file)
    pdf_writer= PdfFileWriter()
    print(pdf_reader.numPages)
    n=pdf_reader.getNumPages()
    for i in range(0, n-1):
    # pdf_writer.addPage(pdf_reader.getPage(i))
    gft=pdf_reader.getPage(i)
    newfile.write(gft.extractText())
    pdf_file.close()
    newfile.close()


    I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.



    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    analyzer = SentimentIntensityAnalyzer()
    with open('rjtjj.txt', 'r') as f:
    for line in f.read().split("n"):
    vs=analyzer.polarity_scores(line)


    I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
    Thank you










    share|improve this question
























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.



      from PyPDF2 import PdfFileReader, PdfFileWriter
      pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
      newfile=open('rjtjj.txt','w')
      pdf_reader= PdfFileReader (pdf_file)
      pdf_writer= PdfFileWriter()
      print(pdf_reader.numPages)
      n=pdf_reader.getNumPages()
      for i in range(0, n-1):
      # pdf_writer.addPage(pdf_reader.getPage(i))
      gft=pdf_reader.getPage(i)
      newfile.write(gft.extractText())
      pdf_file.close()
      newfile.close()


      I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.



      from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
      analyzer = SentimentIntensityAnalyzer()
      with open('rjtjj.txt', 'r') as f:
      for line in f.read().split("n"):
      vs=analyzer.polarity_scores(line)


      I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
      Thank you










      share|improve this question













      I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.



      from PyPDF2 import PdfFileReader, PdfFileWriter
      pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
      newfile=open('rjtjj.txt','w')
      pdf_reader= PdfFileReader (pdf_file)
      pdf_writer= PdfFileWriter()
      print(pdf_reader.numPages)
      n=pdf_reader.getNumPages()
      for i in range(0, n-1):
      # pdf_writer.addPage(pdf_reader.getPage(i))
      gft=pdf_reader.getPage(i)
      newfile.write(gft.extractText())
      pdf_file.close()
      newfile.close()


      I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.



      from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
      analyzer = SentimentIntensityAnalyzer()
      with open('rjtjj.txt', 'r') as f:
      for line in f.read().split("n"):
      vs=analyzer.polarity_scores(line)


      I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
      Thank you







      python-3.x






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 22 at 13:55









      user10277070

      111




      111
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.



          Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.



          If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.






          share|improve this answer





















          • I have installed poppler with home brew. What code should i use? can i use it on python?
            – user10277070
            Nov 22 at 23:58










          • I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If $ pdftotext PEI.pdf wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
            – J_H
            Nov 23 at 0:40













          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53432554%2fvader-sentiment-with-multiple-pdf%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          0
          down vote













          Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.



          Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.



          If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.






          share|improve this answer





















          • I have installed poppler with home brew. What code should i use? can i use it on python?
            – user10277070
            Nov 22 at 23:58










          • I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If $ pdftotext PEI.pdf wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
            – J_H
            Nov 23 at 0:40

















          up vote
          0
          down vote













          Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.



          Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.



          If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.






          share|improve this answer





















          • I have installed poppler with home brew. What code should i use? can i use it on python?
            – user10277070
            Nov 22 at 23:58










          • I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If $ pdftotext PEI.pdf wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
            – J_H
            Nov 23 at 0:40















          up vote
          0
          down vote










          up vote
          0
          down vote









          Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.



          Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.



          If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.






          share|improve this answer












          Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.



          Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.



          If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 22 at 22:21









          J_H

          3,0981616




          3,0981616












          • I have installed poppler with home brew. What code should i use? can i use it on python?
            – user10277070
            Nov 22 at 23:58










          • I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If $ pdftotext PEI.pdf wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
            – J_H
            Nov 23 at 0:40




















          • I have installed poppler with home brew. What code should i use? can i use it on python?
            – user10277070
            Nov 22 at 23:58










          • I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If $ pdftotext PEI.pdf wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
            – J_H
            Nov 23 at 0:40


















          I have installed poppler with home brew. What code should i use? can i use it on python?
          – user10277070
          Nov 22 at 23:58




          I have installed poppler with home brew. What code should i use? can i use it on python?
          – user10277070
          Nov 22 at 23:58












          I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If $ pdftotext PEI.pdf wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
          – J_H
          Nov 23 at 0:40






          I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If $ pdftotext PEI.pdf wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
          – J_H
          Nov 23 at 0:40




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53432554%2fvader-sentiment-with-multiple-pdf%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to ignore python UserWarning in pytest?

          What visual should I use to simply compare current year value vs last year in Power BI desktop

          Script to remove string up to first number