python requests enable cookies/javascript











up vote
2
down vote

favorite
1












I try to download an excel file from a specific website. In my local computer it works perfectly:



>>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
>>> r.status_code
200
>>> r.content
b'xd0xcfx11xe0xa1xb1...x00x00' # Long binary string


But when I connect to a remote ubuntu server, I get a message related to enabling cookies/javascript.



r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
>>> r.status_code
200
>>> r.content
b'<HTML>n<head>n<script>nChallenge=141020;nChallengeId=120854618;nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";n</script>n<script>nfunction test(var1)n{ntvar var_str=""+Challenge;ntvar var_arr=var_str.split("");ntvar LastDig=var_arr.reverse()[0];ntvar minDig=var_arr.sort()[0];ntvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);ntvar subvar2 = (2 * var_arr[2])+var_arr[1];ntvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);ntvar x=(var1*3+subvar1)*1;ntvar y=Math.cos(Math.PI*subvar2);ntvar answer=x*y;ntanswer-=my_pow*1;ntanswer+=(minDig*1)-(LastDig*1);ntanswer=answer+subvar2;ntreturn answer;n}n</script>n<script>nclient = null;nif (window.XMLHttpRequest)n{ntvar client=new XMLHttpRequest();n}nelsen{ntif (window.ActiveXObject)nt{nttclient = new ActiveXObject('MSXML2.XMLHTTP.3.0');nt};n}nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!!.sort)&&(!!.reverse)))n{ntdocument.write("Not all needed JavaScript methods are supported.<BR>");nn}nelsen{ntclient.onreadystatechange = function()nt{nttif(client.readyState == 4)ntt{ntttvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");ntttif ((MyCookie == null) || (MyCookie==""))nttt{nttttdocument.write(client.responseText);nttttreturn;nttt}ntttntttvar cookieName = MyCookie.split('=')[0];ntttif (document.cookie.indexOf(cookieName)==-1)nttt{nttttdocument.write(GenericErrorMessageCookies);nttttreturn;nttt}ntttwindow.location.reload(true);ntt}nt};nty=test(Challenge);ntclient.open("POST",window.location,true);ntclient.setRequestHeader('X-AA-Challenge-ID', ChallengeId);ntclient.setRequestHeader('X-AA-Challenge-Result',y);ntclient.setRequestHeader('X-AA-Challenge',Challenge);ntclient.setRequestHeader('Content-Type' , 'text/plain');ntclient.send();n}n</script>n</head>n<body>n<noscript>JavaScript must be enabled in order to view this page.</noscript>n</body>n</HTML>'


On local I run from MACos that has Chrome installed (I'm not actively using it for the script, but maybe it's related?), on remote I run ubuntu on digital ocean without any GUI browser installed.










share|improve this question


























    up vote
    2
    down vote

    favorite
    1












    I try to download an excel file from a specific website. In my local computer it works perfectly:



    >>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
    >>> r.status_code
    200
    >>> r.content
    b'xd0xcfx11xe0xa1xb1...x00x00' # Long binary string


    But when I connect to a remote ubuntu server, I get a message related to enabling cookies/javascript.



    r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
    >>> r.status_code
    200
    >>> r.content
    b'<HTML>n<head>n<script>nChallenge=141020;nChallengeId=120854618;nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";n</script>n<script>nfunction test(var1)n{ntvar var_str=""+Challenge;ntvar var_arr=var_str.split("");ntvar LastDig=var_arr.reverse()[0];ntvar minDig=var_arr.sort()[0];ntvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);ntvar subvar2 = (2 * var_arr[2])+var_arr[1];ntvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);ntvar x=(var1*3+subvar1)*1;ntvar y=Math.cos(Math.PI*subvar2);ntvar answer=x*y;ntanswer-=my_pow*1;ntanswer+=(minDig*1)-(LastDig*1);ntanswer=answer+subvar2;ntreturn answer;n}n</script>n<script>nclient = null;nif (window.XMLHttpRequest)n{ntvar client=new XMLHttpRequest();n}nelsen{ntif (window.ActiveXObject)nt{nttclient = new ActiveXObject('MSXML2.XMLHTTP.3.0');nt};n}nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!!.sort)&&(!!.reverse)))n{ntdocument.write("Not all needed JavaScript methods are supported.<BR>");nn}nelsen{ntclient.onreadystatechange = function()nt{nttif(client.readyState == 4)ntt{ntttvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");ntttif ((MyCookie == null) || (MyCookie==""))nttt{nttttdocument.write(client.responseText);nttttreturn;nttt}ntttntttvar cookieName = MyCookie.split('=')[0];ntttif (document.cookie.indexOf(cookieName)==-1)nttt{nttttdocument.write(GenericErrorMessageCookies);nttttreturn;nttt}ntttwindow.location.reload(true);ntt}nt};nty=test(Challenge);ntclient.open("POST",window.location,true);ntclient.setRequestHeader('X-AA-Challenge-ID', ChallengeId);ntclient.setRequestHeader('X-AA-Challenge-Result',y);ntclient.setRequestHeader('X-AA-Challenge',Challenge);ntclient.setRequestHeader('Content-Type' , 'text/plain');ntclient.send();n}n</script>n</head>n<body>n<noscript>JavaScript must be enabled in order to view this page.</noscript>n</body>n</HTML>'


    On local I run from MACos that has Chrome installed (I'm not actively using it for the script, but maybe it's related?), on remote I run ubuntu on digital ocean without any GUI browser installed.










    share|improve this question
























      up vote
      2
      down vote

      favorite
      1









      up vote
      2
      down vote

      favorite
      1






      1





      I try to download an excel file from a specific website. In my local computer it works perfectly:



      >>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
      >>> r.status_code
      200
      >>> r.content
      b'xd0xcfx11xe0xa1xb1...x00x00' # Long binary string


      But when I connect to a remote ubuntu server, I get a message related to enabling cookies/javascript.



      r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
      >>> r.status_code
      200
      >>> r.content
      b'<HTML>n<head>n<script>nChallenge=141020;nChallengeId=120854618;nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";n</script>n<script>nfunction test(var1)n{ntvar var_str=""+Challenge;ntvar var_arr=var_str.split("");ntvar LastDig=var_arr.reverse()[0];ntvar minDig=var_arr.sort()[0];ntvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);ntvar subvar2 = (2 * var_arr[2])+var_arr[1];ntvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);ntvar x=(var1*3+subvar1)*1;ntvar y=Math.cos(Math.PI*subvar2);ntvar answer=x*y;ntanswer-=my_pow*1;ntanswer+=(minDig*1)-(LastDig*1);ntanswer=answer+subvar2;ntreturn answer;n}n</script>n<script>nclient = null;nif (window.XMLHttpRequest)n{ntvar client=new XMLHttpRequest();n}nelsen{ntif (window.ActiveXObject)nt{nttclient = new ActiveXObject('MSXML2.XMLHTTP.3.0');nt};n}nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!!.sort)&&(!!.reverse)))n{ntdocument.write("Not all needed JavaScript methods are supported.<BR>");nn}nelsen{ntclient.onreadystatechange = function()nt{nttif(client.readyState == 4)ntt{ntttvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");ntttif ((MyCookie == null) || (MyCookie==""))nttt{nttttdocument.write(client.responseText);nttttreturn;nttt}ntttntttvar cookieName = MyCookie.split('=')[0];ntttif (document.cookie.indexOf(cookieName)==-1)nttt{nttttdocument.write(GenericErrorMessageCookies);nttttreturn;nttt}ntttwindow.location.reload(true);ntt}nt};nty=test(Challenge);ntclient.open("POST",window.location,true);ntclient.setRequestHeader('X-AA-Challenge-ID', ChallengeId);ntclient.setRequestHeader('X-AA-Challenge-Result',y);ntclient.setRequestHeader('X-AA-Challenge',Challenge);ntclient.setRequestHeader('Content-Type' , 'text/plain');ntclient.send();n}n</script>n</head>n<body>n<noscript>JavaScript must be enabled in order to view this page.</noscript>n</body>n</HTML>'


      On local I run from MACos that has Chrome installed (I'm not actively using it for the script, but maybe it's related?), on remote I run ubuntu on digital ocean without any GUI browser installed.










      share|improve this question













      I try to download an excel file from a specific website. In my local computer it works perfectly:



      >>> r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
      >>> r.status_code
      200
      >>> r.content
      b'xd0xcfx11xe0xa1xb1...x00x00' # Long binary string


      But when I connect to a remote ubuntu server, I get a message related to enabling cookies/javascript.



      r = requests.get('http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls')
      >>> r.status_code
      200
      >>> r.content
      b'<HTML>n<head>n<script>nChallenge=141020;nChallengeId=120854618;nGenericErrorMessageCookies="Cookies must be enabled in order to view this page.";n</script>n<script>nfunction test(var1)n{ntvar var_str=""+Challenge;ntvar var_arr=var_str.split("");ntvar LastDig=var_arr.reverse()[0];ntvar minDig=var_arr.sort()[0];ntvar subvar1 = (2 * (var_arr[2]))+(var_arr[1]*1);ntvar subvar2 = (2 * var_arr[2])+var_arr[1];ntvar my_pow=Math.pow(((var_arr[0]*1)+2),var_arr[1]);ntvar x=(var1*3+subvar1)*1;ntvar y=Math.cos(Math.PI*subvar2);ntvar answer=x*y;ntanswer-=my_pow*1;ntanswer+=(minDig*1)-(LastDig*1);ntanswer=answer+subvar2;ntreturn answer;n}n</script>n<script>nclient = null;nif (window.XMLHttpRequest)n{ntvar client=new XMLHttpRequest();n}nelsen{ntif (window.ActiveXObject)nt{nttclient = new ActiveXObject('MSXML2.XMLHTTP.3.0');nt};n}nif (!((!!client)&&(!!Math.pow)&&(!!Math.cos)&&(!!.sort)&&(!!.reverse)))n{ntdocument.write("Not all needed JavaScript methods are supported.<BR>");nn}nelsen{ntclient.onreadystatechange = function()nt{nttif(client.readyState == 4)ntt{ntttvar MyCookie=client.getResponseHeader("X-AA-Cookie-Value");ntttif ((MyCookie == null) || (MyCookie==""))nttt{nttttdocument.write(client.responseText);nttttreturn;nttt}ntttntttvar cookieName = MyCookie.split('=')[0];ntttif (document.cookie.indexOf(cookieName)==-1)nttt{nttttdocument.write(GenericErrorMessageCookies);nttttreturn;nttt}ntttwindow.location.reload(true);ntt}nt};nty=test(Challenge);ntclient.open("POST",window.location,true);ntclient.setRequestHeader('X-AA-Challenge-ID', ChallengeId);ntclient.setRequestHeader('X-AA-Challenge-Result',y);ntclient.setRequestHeader('X-AA-Challenge',Challenge);ntclient.setRequestHeader('Content-Type' , 'text/plain');ntclient.send();n}n</script>n</head>n<body>n<noscript>JavaScript must be enabled in order to view this page.</noscript>n</body>n</HTML>'


      On local I run from MACos that has Chrome installed (I'm not actively using it for the script, but maybe it's related?), on remote I run ubuntu on digital ocean without any GUI browser installed.







      python cookies browser python-requests






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 22 at 15:56









      DeanLa

      652616




      652616
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          1
          down vote



          accepted










          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()





          share|improve this answer























          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?
            – kcorlidy
            Nov 23 at 5:59










          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.
            – cody
            Nov 23 at 11:25










          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.
            – DeanLa
            Nov 23 at 14:21










          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>
            – cody
            Nov 23 at 15:10










          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.
            – DeanLa
            Nov 23 at 17:30











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434555%2fpython-requests-enable-cookies-javascript%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          1
          down vote



          accepted










          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()





          share|improve this answer























          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?
            – kcorlidy
            Nov 23 at 5:59










          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.
            – cody
            Nov 23 at 11:25










          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.
            – DeanLa
            Nov 23 at 14:21










          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>
            – cody
            Nov 23 at 15:10










          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.
            – DeanLa
            Nov 23 at 17:30















          up vote
          1
          down vote



          accepted










          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()





          share|improve this answer























          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?
            – kcorlidy
            Nov 23 at 5:59










          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.
            – cody
            Nov 23 at 11:25










          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.
            – DeanLa
            Nov 23 at 14:21










          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>
            – cody
            Nov 23 at 15:10










          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.
            – DeanLa
            Nov 23 at 17:30













          up vote
          1
          down vote



          accepted







          up vote
          1
          down vote



          accepted






          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()





          share|improve this answer














          The behavior of requests has nothing to do with what browsers are installed on the system, it does not depend on or interact with them in any way.



          The problem here is that the resource you are requesting has some kind of "bot mitigation" mechanism enabled to prevent just this kind of access. It returns some javascript with logic that needs to be evaluated, and the results of that logic are then used for an additional request to "prove" you're not a bot.



          Luckily, it appears that this specific mitigation mechanism has been solved before, and I was able to quickly get this request working utilizing the challenge-solving functions from that code:



          from math import cos, pi, floor

          import requests

          URL = 'http://www.health.gov.il/PublicationsFiles/IWER01_2004.xls'


          def parse_challenge(page):
          """
          Parse a challenge given by mmi and mavat's web servers, forcing us to solve
          some math stuff and send the result as a header to actually get the page.
          This logic is pretty much copied from https://github.com/R3dy/jigsaw-rails/blob/master/lib/breakbot.rb
          """
          top = page.split('<script>')[1].split('n')
          challenge = top[1].split(';')[0].split('=')[1]
          challenge_id = top[2].split(';')[0].split('=')[1]
          return {'challenge': challenge, 'challenge_id': challenge_id, 'challenge_result': get_challenge_answer(challenge)}


          def get_challenge_answer(challenge):
          """
          Solve the math part of the challenge and get the result
          """
          arr = list(challenge)
          last_digit = int(arr[-1])
          arr.sort()
          min_digit = int(arr[0])
          subvar1 = (2 * int(arr[2])) + int(arr[1])
          subvar2 = str(2 * int(arr[2])) + arr[1]
          power = ((int(arr[0]) * 1) + 2) ** int(arr[1])
          x = (int(challenge) * 3 + subvar1)
          y = cos(pi * subvar1)
          answer = x * y
          answer -= power
          answer += (min_digit - last_digit)
          answer = str(int(floor(answer))) + subvar2
          return answer


          def main():
          s = requests.Session()
          r = s.get(URL)

          if 'X-AA-Challenge' in r.text:
          challenge = parse_challenge(r.text)
          r = s.get(URL, headers={
          'X-AA-Challenge': challenge['challenge'],
          'X-AA-Challenge-ID': challenge['challenge_id'],
          'X-AA-Challenge-Result': challenge['challenge_result']
          })

          yum = r.cookies
          r = s.get(URL, cookies=yum)

          print(r.content)


          if __name__ == '__main__':
          main()






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 23 at 11:27

























          answered Nov 22 at 16:38









          cody

          1,659418




          1,659418












          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?
            – kcorlidy
            Nov 23 at 5:59










          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.
            – cody
            Nov 23 at 11:25










          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.
            – DeanLa
            Nov 23 at 14:21










          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>
            – cody
            Nov 23 at 15:10










          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.
            – DeanLa
            Nov 23 at 17:30


















          • im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?
            – kcorlidy
            Nov 23 at 5:59










          • @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.
            – cody
            Nov 23 at 11:25










          • Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.
            – DeanLa
            Nov 23 at 14:21










          • @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>
            – cody
            Nov 23 at 15:10










          • My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.
            – DeanLa
            Nov 23 at 17:30
















          im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?
          – kcorlidy
          Nov 23 at 5:59




          im so curious about this question, and can you explain why he successfully request on local machine but fail on remote one with same code?
          – kcorlidy
          Nov 23 at 5:59












          @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.
          – cody
          Nov 23 at 11:25




          @kcorlidy That is a great question, but I'm really not sure. I suspect that the mitigation mechanism might not kick in on every request, but only ones it deems "suspicious". It may be based on factors like the requestor's ip address, whether that ip address has successfully accessed other resources recently, etc. You see that kind of behavior a lot with things like captchas.
          – cody
          Nov 23 at 11:25












          Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.
          – DeanLa
          Nov 23 at 14:21




          Thanks @cody this seems like an elegant solution, but for some reason, it doesn't work. When I debug, after the get with the headers, I try r.cookies and I receive an empty cookie jar <RequestsCookieJar>.
          – DeanLa
          Nov 23 at 14:21












          @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>
          – cody
          Nov 23 at 15:10




          @DeanLa Hmm.. did you try running the code I provided verbatim? I get the full binary content of the XLS file as output. If I have it print the value of yum, the cookie, I get something like <RequestsCookieJar[<Cookie BotMitigationCookie_1....for www.health.gov.il/>]>
          – cody
          Nov 23 at 15:10












          My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.
          – DeanLa
          Nov 23 at 17:30




          My bad, I wasn't thinking of non-existing files. They don't return 404 as previously, they return empty cookie jars.
          – DeanLa
          Nov 23 at 17:30


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.





          Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


          Please pay close attention to the following guidance:


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434555%2fpython-requests-enable-cookies-javascript%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to ignore python UserWarning in pytest?

          What visual should I use to simply compare current year value vs last year in Power BI desktop

          Script to remove string up to first number