Vader Sentiment with multiple PDF
up vote
0
down vote
favorite
I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you
python-3.x
add a comment |
up vote
0
down vote
favorite
I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you
python-3.x
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you
python-3.x
I have recently merged 20 pdf in 1 pdf via adobe. I have import the pdf in python with this code.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file = open ('/Users/cj/Desktop/PEI.pdf','rb')
newfile=open('rjtjj.txt','w')
pdf_reader= PdfFileReader (pdf_file)
pdf_writer= PdfFileWriter()
print(pdf_reader.numPages)
n=pdf_reader.getNumPages()
for i in range(0, n-1):
# pdf_writer.addPage(pdf_reader.getPage(i))
gft=pdf_reader.getPage(i)
newfile.write(gft.extractText())
pdf_file.close()
newfile.close()
I'm trying to use Vadersentiment to analyse the pdf. What i want to do is analyse individually the 20 pdf that are merged into 1.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
with open('rjtjj.txt', 'r') as f:
for line in f.read().split("n"):
vs=analyzer.polarity_scores(line)
I know my code is wrong, because it only gives me the first line of the entire pdf. I am new to this, i would really appreciate your help.
Thank you
python-3.x
python-3.x
asked Nov 22 at 13:55
user10277070
111
111
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext
command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler
. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.
I have installed poppler with home brew. What code should i use? can i use it on python?
– user10277070
Nov 22 at 23:58
I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If$ pdftotext PEI.pdf
wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
– J_H
Nov 23 at 0:40
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext
command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler
. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.
I have installed poppler with home brew. What code should i use? can i use it on python?
– user10277070
Nov 22 at 23:58
I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If$ pdftotext PEI.pdf
wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
– J_H
Nov 23 at 0:40
add a comment |
up vote
0
down vote
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext
command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler
. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.
I have installed poppler with home brew. What code should i use? can i use it on python?
– user10277070
Nov 22 at 23:58
I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If$ pdftotext PEI.pdf
wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
– J_H
Nov 23 at 0:40
add a comment |
up vote
0
down vote
up vote
0
down vote
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext
command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler
. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.
Your problem really isn't about Vader sentiment analysis -- it is about correct extraction of text from a PDF.
Postscript's forth interpreter is Turing-complete, so some PDF documents are "hard" to parse. You didn't post your PDF so we can only guess at the issue. You might try using poppler's pdftotext
command line utility instead. Ubuntu calls the package "poppler-utils"; on mac you would use brew install poppler
. Running through pdf2ps & ps2ascii will sometimes offer different, and helpful, results.
If you continue to find it difficult to retrieve proper text from the PDF, you may want to contact whoever produced the PDF and settle on supplying the same information in a revised format.
answered Nov 22 at 22:21
J_H
3,0981616
3,0981616
I have installed poppler with home brew. What code should i use? can i use it on python?
– user10277070
Nov 22 at 23:58
I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If$ pdftotext PEI.pdf
wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
– J_H
Nov 23 at 0:40
add a comment |
I have installed poppler with home brew. What code should i use? can i use it on python?
– user10277070
Nov 22 at 23:58
I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If$ pdftotext PEI.pdf
wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.
– J_H
Nov 23 at 0:40
I have installed poppler with home brew. What code should i use? can i use it on python?
– user10277070
Nov 22 at 23:58
I have installed poppler with home brew. What code should i use? can i use it on python?
– user10277070
Nov 22 at 23:58
I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If
$ pdftotext PEI.pdf
wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.– J_H
Nov 23 at 0:40
I was suggesting two things: (1) The PDF format can be "really hard" to parse ascii text from, and (2) different PDF parsers come at it from different directions, so one might win in a certain situation, like table formatted pages, where another parser happens to lose. You didn't disclose the PDF of interest, nor how it was produced, nor how it might be produced through alternate means, including output to .TXT or to .CSV. If
$ pdftotext PEI.pdf
wins, and that is a big "if", then python could simply consume the resulting PEI.txt ascii text, without needing a PDF library at all.– J_H
Nov 23 at 0:40
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53432554%2fvader-sentiment-with-multiple-pdf%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown