HDFS plugin lets Deepgreen DB access files residing on Hadoop File System. It currently support two file formats: CSV and SPQ.
Suppose we store data in the directory /dw/data
under HDFS, the
following mountpoint would allow access to csv files:
[[xdrive.mount]]
name = "datacsv" # mountpoint name
argv = ["xdr_hdfs/xdr_hdfs", # plugin
"csv", # file format (csv or spq)
"hdfsnamenode", # HDFS name node host or service name
"8020", # HDFS name node port
"/dw/data", # root data dir
]
env = ["LIBHDFS3_CONF=path-to-hdfs-client.xml"]
If hdfsnamenode
is a service name, the value of hdfsnamenode
will start with schema hdfs://
and
the value of port will be ‘0’. e.g.,
[[xdrive.mount]]
name = "datacsv" # mountpoint name
argv = ["xdr_hdfs/xdr_hdfs", # plugin
"csv", # file format (csv or spq)
"hdfs://mycluster", # HDFS name node host or service name
"0", # HDFS name node port
"/dw/data", # root data dir
]
env = ["LIBHDFS3_CONF=path-to-hdfs-client.xml"]
External table DDL that refers to HDFS Plugin must have LOCATION
clause that looks like this for Reads:
LOCATION(`xdrive://XDRIVE-HOST-PORT/MOUNTPOINT/path-to-files-with-wildcard.{csv|spq}`)
FORMAT '{CSV|SPQ}'
And for Writes:
LOCATION(`xdrive://XDRIVE-HOST-PORT/MOUNTPOINT/path-to-files-with-#UUID#.{csv|spq}`)
FORMAT '{CSV|SPQ}'
DROP EXTERNAL TABLE IF EXISTS nation;
CREATE EXTERNAL TABLE nation (
n_nationkey integer,
n_name varchar,
n_regionkey integer,
n_comment varchar)
LOCATION ('xdrive://XDRIVE-HOST-PORT/datacsv/public/nation/nation*.csv')
FORMAT 'CSV';
DROP EXTERNAL TABLE IF EXISTS nation_writer;
CREATE WRITABLE EXTERNAL TABLE nation_writer (
n_nationkey integer,
n_name varchar,
n_regionkey integer,
n_comment varchar)
LOCATION ('xdrive://XDRIVE-HOST-PORT/datacsv/public/nation/nation#UUID#.csv')
FORMAT 'CSV';
You may pass the HDFS client configuration file by the environment varialble LIBHDFS3_CONF
, which
should explicitly point to the hdfs-client.xml
file you want to use.
hdfs-client.xml
is the file with the similar format as core-site.xml
and hdfs-site.xml
.
You may concat core-site.xml
and hdfs-site.xml
to a single hdfs-client.xml
.
For non-secure connection, you’ll need to set hadoop.security.authentication
property to simple
and
com.vitessedata.xdrive.security.user.name
to username in Hadoop.
<configuration>
<property>
<name>hadoop.security.authentication</name>
<value>simple</value>
</property>
<property>
<name>com.vitessedata.xdrive.security.user.name</name>
<value>hduser<value>
</property>
</configuration>
For secure connection, you’ll need to change hadoop.security.authentication
property from simple
to kerberos
.
If kerberos ticket cache is used for authentication, you can set the configuration
hadoop.security.kerberos.ticket.cache.path
with the value of ticket cache path
in the hdfs-client.xml
.
<configuration>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.kerberos.ticket.cache.path</name>
<value>kerberos_ticket_cache_filepath</value>
</property>
</configuration>
or you can set the environment variable KRB5CCNAME
to specify the ticket cache file path.
[[xdrive.mount]]
name = "datacsv" # mountpoint name
argv = ["xdr_hdfs/xdr_hdfs", # plugin
"csv", # file format (csv or spq)
"hdfsnamenode", # HDFS name node host or service name
"8020", # HDFS name node port
"/dw/data", # root data dir
]
env = ["LIBHDFS3_CONF=path-to-hdfs-client.xml", "KRB5CCNAME=path-to-kerberos-ticket-cache"]
Although HDFS is resilient to failure of data-nodes, the name-node is a single repository of metadata for the system, and so a single point of failure. High-availability (HA) involves configuring fall-back name-nodes which can take over in the event of failure.
<configuration>
<property>
<name>dfs.default.uri</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>namenode1,namenode2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.namenode1</name>
<value>hostname1:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.namenode2</name>
<value>hostname2:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.namenode1</name>
<value>hostname1:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.namenode2</name>
<value>hostname2:50070</value>
</property>
</configuration>
To Test the HDFS client configuration, you may run the $GPHOME/plugin/xdr_hdfs/xdr_hdfscmd
to test
the connectivity to Hadoop server.
To read the content of a file on HDFS,
LIBHDFS3_CONF=path-to-hdfs-client.xml xdr_hdfscmd cat 127.0.0.1 8020 /data/nation.csv
To print out the list of files in a directory on HDFS,
LIBHDFS3_CONF=path-to-hdfs-client.xml xdr_hdfscmd ls 127.0.0.1 8020 /data/