I am using impyla 0.9.0, if I specify port in the connect
conn = impala.dbapi.connect(host = 'n1', port = 21000)
I will get the following error
Traceback (most recent call last):
File "./myquery.py", line 78, in <module>
main(len(sys.argv), sys.argv)
File "./myquery.py", line 58, in main
cur = conn.cursor()
File "/usr/lib/python2.6/site-packages/impala/dbapi/hiveserver2.py", line 55, in cursor
rpc.open_session(self.service, user, configuration))
File "/usr/lib/python2.6/site-packages/impala/_rpc/hiveserver2.py", line 132, in wrapper
return func(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/impala/_rpc/hiveserver2.py", line 214, in open_session
resp = service.OpenSession(req)
File "/usr/lib/python2.6/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 175, in OpenSession
return self.recv_OpenSession()
File "/usr/lib/python2.6/site-packages/impala/_thrift_gen/TCLIService/TCLIService.py", line 191, in recv_OpenSession
raise x
thrift.Thrift.TApplicationException: Invalid method name: 'OpenSession'
But it is a valid port.
impala - shell - i n1: 21000
Starting Impala Shell without Kerberos authentication
Connected to n1: 21000
Server version: impalad version 2.1 .1 - cdh5 RELEASE(build 7901877736e29716147 c4804b0841afc4ebc9037)
Welcome to the Impala shell.Press TAB twice to see a list of available commands.
Copyright(c) 2012 Cloudera, Inc.All rights reserved.
(Shell build version: Impala Shell v2 .1 .1 - cdh5(7901877) built on Tue Jan 27 16: 23: 42 PST 2015)[n1: 21000] >
There seems to be different version of thrift-sasl and impyla that work or dont work and it is not easy to figure out these version mismatches. So we finally abandoned impyla and went with pyodbc with cloudera impala odbc driver which is easier to make it work and is working good so far. Check out this link: https://plenium.wordpress.com/2020/05/04/use-pyodbc-with-cloudera-impala-odbc-and-kerberos/,Been getting the same error when I was trying to connect to the impala instance on a kerberized cluster! Any particular reason why we get this??,@JasonBourne - if you have the same issue, here's a GitHub issue discussing it and linking to a pull request to fix it:https://github.com/cloudera/thrift_sasl/issues/28You can see in the commits (here: https://github.com/cloudera/thrift_sasl/commits/master), they are testing a new release for a fix, but it looks like it's not quite done yet. Hopefully soon.,After trying various options and setting timeout=100 in the connect statement, it appears the script queries impala table successfully but every 2nd or 3rd time it fails with the below error:
Tried:
from impala.dbapi import connectconn = connect(host = 'my.impala.host', port = 21050) cursor = conn.cursor() cursor.execute('SELECT * FROM youval_db.accounts_info LIMIT 10') print cursor.description # prints the result set 's schemaresults = cursor.fetchall()
Also tried with
conn = connect()
---------------------------------------------------------------------------
HiveServer2Error Traceback (most recent call last)
<ipython-input-13-82112a6ffca2> in <module>()
2 conn = connect(host='myhost', port=21050)
3
----> 4 cursor = conn.cursor()
5 cursor.execute('SELECT * FROM default.testtable')
6 print (cursor.description) # prints the result set's schema
/data/opt/anaconda3/lib/python3.7/site-packages/impala/hiveserver2.py in cursor(self, user, configuration, convert_types, dictify, fetch_error)
122 log.debug('.cursor(): getting new session_handle')
123
--> 124 session = self.service.open_session(user, configuration)
125
126 log.debug('HiveServer2Cursor(service=%s, session_handle=%s, '
/data/opt/anaconda3/lib/python3.7/site-packages/impala/hiveserver2.py in open_session(self, user, configuration)
1062 username=user,
1063 configuration=configuration)
-> 1064 resp = self._rpc('OpenSession', req)
1065 return HS2Session(self, resp.sessionHandle,
1066 resp.configuration,
/data/opt/anaconda3/lib/python3.7/site-packages/impala/hiveserver2.py in _rpc(self, func_name, request)
990 def _rpc(self, func_name, request):
991 self._log_request(func_name, request)
--> 992 response = self._execute(func_name, request)
993 self._log_response(func_name, response)
994 err_if_rpc_not_ok(response)
/data/opt/anaconda3/lib/python3.7/site-packages/impala/hiveserver2.py in _execute(self, func_name, request)
1021
1022 raise HiveServer2Error('Failed after retrying {0} times'
-> 1023 .format(self.retries)) 1024
1025 def _operation(self, kind, request):
HiveServer2Error: Failed after retrying 3 times
/data/opt / anaconda3 / lib / python3 .7 / site - packages / thrift_sasl / __init__.py in open(self)
65
66 def open(self):
-- - > 67
if not self._trans.isOpen():
68 self._trans.open()
69
AttributeError: 'TSocket'
object has no attribute 'isOpen'
The hang seems to be in the statement buff = self.sock.recv(sz)
/data/opt / anaconda3 / lib / python3 .7 / site - packages / thriftpy2 / transport / socket.py in read(self, sz)
107
while True:
108
try:
-- > 109 buff = self.sock.recv(sz)
110 except socket.error as e:
111
if e.errno == errno.EINTR:
KeyboardInterrupt:
After trying various options and setting timeout=100 in the connect statement, it appears the script queries impala table successfully but every 2nd or 3rd time it fails with the below error:
/data/opt / anaconda3 / lib / python3 .7 / site - packages / impala / hiveserver2.py in _rpc(self, func_name, request)
992 response = self._execute(func_name, request)
993 self._log_response(func_name, response)
-- > 994 err_if_rpc_not_ok(response)
995
return response
996
/
data / opt / anaconda3 / lib / python3 .7 / site - packages / impala / hiveserver2.py in err_if_rpc_not_ok(resp)
746 resp.status.statusCode != TStatusCode.SUCCESS_WITH_INFO_STATUS and
747 resp.status.statusCode != TStatusCode.STILL_EXECUTING_STATUS):
-- > 748 raise HiveServer2Error(resp.status.errorMessage)
749
750
HiveServer2Error: Invalid query handle: b14cce8e19xxxx: 5 b51463xxxx
These issues can cause incorrect or unexpected results from queries. They typically only arise in very specific circumstances. , Using a CAST() function to convert large literal values to smaller types, or to convert special values such as NaN or Inf, produces values not consistent with other database systems. This could lead to unexpected results from queries. , These issues affect the ability to interchange data between Impala and other database systems. They cover areas such as data types and file formats. , These issues can prevent one or more Impala-related daemons from starting properly.
Impala could encounter a serious error due to resource usage under very high concurrency. The error message is similar to:
F0629 08:20:02.956413 29088 llvm-codegen.cc:111] LLVM hit fatal error: Unable to allocate section memory!
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::thread_resource_error> >'
Workaround: To prevent such errors, configure each host running an impalad daemon with the following settings:
echo 2000000 > /proc/sys / kernel / threads - max
echo 2000000 > /proc/sys / kernel / pid_max
echo 8000000 > /proc/sys / vm / max_map_count
Add the following lines in /etc/security/limits.conf:
impala soft nproc 262144 impala hard nproc 262144
An OUTER JOIN
query could omit some expected result rows due to a
constant such as FALSE
in another join clause. For example:
explain SELECT 1 FROM alltypestiny a1
INNER JOIN alltypesagg a2 ON a1.smallint_col = a2.year AND false
RIGHT JOIN alltypes a3 ON a1.year = a1.bigint_col; +
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - +
|
Explain String |
+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - +
|
Estimated Per - Host Requirements: Memory = 1.00 KB VCores = 1 |
|
|
|
00: EMPTYSET |
+ -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - +
In Impala 3.2 and higher, if the following error appears multiple
times in a short duration while running a query, it would mean that
the connection between the impalad
and the HDFS
NameNode is in a bad state and hence the impalad
would have to be restarted:
"hdfsOpenFile() for <filename> at backend <hostname:port> failed to finish before the <hdfs_operation_timeout_sec> second timeout "
ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL' = 'TRUE');
A table and a database that share the same name can cause a query failure if the table is not readable by Impala, for example, the table was created in Hive in the Open CSV Serde format. The following exception will return:
CAUSED BY: TableLoadingException: Unrecognized table type for table
I was able to connect to HiveServer2, via a Java client, and so it seems that the connectivity issue is Python/Impyla specific. When I debug/step-into, the code hangs at line 873 of hiveserver2.py.,When I try to run the following code, the client hangs when trying to connect to Hive:,Exactly a year later, still getting this issue. It seems to hang consistently with certain queries (which only return ~200 rows tops) which are near-instant using Database query tools such as DBeaver. Other queries work fine, even when they are more complicated and return more records.,It would be nice to understand this issue if you do figure out – unfortunately I don’t have the bandwidth to help debug it further. If you are able to sort it out (and if it is an impyla bug) please let me know the resolution here. cc @mjacobs
When I try to run the following code, the client hangs when trying to connect to Hive:
from impala.dbapi
import connect
conn = connect(host = 'host_running_hs2_service', port = 10000, user = 'awoolford', password = 'Bzzzzz')
cursor = conn.cursor() < -hangs here
cursor.execute('show tables')
results = cursor.fetchall()
print results
Hang occurs @ TOpenSessionReq
Attempting to open transport (tries_left=2)
Transport opened
Establishing Connection
Connecting to HiveServer2 hostname:25003 with PLAIN authentication mechanism
get_socket: host=hostname port=25003 use_ssl=False ca_cert=None
sock=<thrift.transport.TSocket.TSocket instance at 0x7f765fea0aa0>
get_transport: socket=<thrift.transport.TSocket.TSocket instance at 0x7f765fea0aa0> host=hostname kerberos_service_name=impala auth_mechanism=PLAIN user=userpassword=fuggetaboutit
transport=<thrift_sasl.TSaslClientTransport instance at 0x7f765fea0e60> protocol=<thrift.protocol.TBinaryProtocol.TBinaryProtocolAccelerated instance at 0x7f765fea7140> service=<impala._thrift_gen.ImpalaService.ImpalaHiveServer2Service.Client object at 0x7f765fe9dd50>
HiveServer2Connection(service=<impala.hiveserver2.HS2Service object at 0x7f765fe9dd90>, default_db=co5012_cpi_int)
Connection Established
Acquiring Cursor
Getting a cursor (Impala session)
.cursor(): getting new session_handle
OpenSession: req=TOpenSessionReq(username='root', password=None, client_protocol=5, configuration=None)
Attempting to open transport (tries_left=3)
Transport opened
Python client for HiveServer2 implementations (e.g., Impala, Hive) for distributed query engines.,HiveServer2 compliant; works with Impala and Hive, including nested data, HiveServer2 compliant; works with Impala and Hive, including nested data ,Converter to pandas DataFrame, allowing easy integration into the Python data stack (including scikit-learn and matplotlib); but see the Ibis project for a richer experience
Ubuntu:
apt - get install libkrb5 - dev krb5 - user
RHEL/CentOS:
yum install krb5 - libs krb5 - devel krb5 - server krb5 - workstation
Install the latest release with pip
:
pip install impyla
or clone the repo:
git clone https: //github.com/cloudera/impyla.git
cd impyla
python setup.py install
impyla uses the pytest toolchain, and depends on the following environment variables:
export IMPYLA_TEST_HOST = your.impalad.com
export IMPYLA_TEST_PORT = 21050
export IMPYLA_TEST_AUTH_MECH = NOSASL