gluster-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gluster-devel] Follow: GSoC Proposal for a RESTful/JSON API and ser


From: RJ Nowling
Subject: Re: [Gluster-devel] Follow: GSoC Proposal for a RESTful/JSON API and server for GlusterFS similar to WebHDFS
Date: Tue, 18 Mar 2014 22:31:59 -0400

Hi Jay,

Thank you for the interest and response.

A few use cases:
1) Disco (a map-reduce framework written in Erlang and Python) is using WebHDFS to add HDFS support.
2) WebHDFS is used to provide HDFS access for python and ruby.
3) Hadoop has a FileSystem plugin for WebHDFS -- used when you need to go through a firewall or other situations where the regular HDFS network protocol isn't feasible.  
4) Spring (the web framework) supports accessing data in HDFS using WebHDFS
5) Fluentd (data collection framework) supports HDFS using WebHDFS as a plugin.

For your second question:
Just to clarify a few points.  WebHDFS is the server and API.  The main part of my proposal, as it stands now, is to provide similar functionality for Gluster.  Hadoop does provide a client for WebHDFS that allows WebHDFS to be used as an alternative protocol for HDFS. 

According to the docs, the WebHDFS API provides complete support for the FileSystem API.  It's possible that the WebHDFS API could be seen as a generic HCFS API and that my proposed Gluster RESTful interface API could implement a compatibility mode for the WebHDFS API so that any client that can use WebHDFS can use the Gluster RESTful API.  Examples in this case would include the Hadoop WebHDFS client, Spring, and Fluentd WebHDFS plugin.

If a compatible API is implemented, the Hadoop WebHDFS client with the Gluster RESTful server could be used in place of the GlusterFS-hadoop plugin / FUSE client combination.

We would need to discuss whether the FileSystem API (as mirrored by the WebHDFS API) is, by itself, sufficient for all users of Gluster or not.  If it is, then we can just implement that and focus the proposal on providing compatibility with WebHDFS clients.  If not, we can develop an API that mirrors Gluster semantic and provide a compatibility mode for the WebHDFS API.

RJ


On Tue, Mar 18, 2014 at 10:06 PM, Jay Vyas <address@hidden> wrote:
I definetly like the idea.... Thanks for putting this together RJ.

- what  are the main use cases for webhdfs and how do people currently use it in the real world?

- what portions of the FileSystem and FileContext contract does webhdfs cover , and can we morph it's client , to make it hcfs compatible, and leverage our existing GlusterFS-hadoop plugin ?

I can help mentor it from the perspective of the java integration and API usability, and I'm sure we can help to track down some folks on the C/gluster side of things is able to help me on the lower level details.  

On Mar 18, 2014, at 9:20 PM, RJ Nowling <address@hidden> wrote:

Hi all,

I wanted to follow up.  I drafted a proposal for creating a RESTful/JSON API and server for GlusterFS similar to WebHDFS.  As the number of big data processing and storage systems explode, integration is becoming more important.  A language and operating system agnostic RESTful/JSON API and server could be helpful for easing integration efforts.

I've pasted the proposal below.  Is there is any interest in the Gluster community?  Would anyone be willing to server as a mentor?

Thank you,
RJ

RESTful/JSON API and Server for GlusterFS

Overview of proposal:
The goal of the proposal is to create a RESTful/JSON API and server (similar to WebHDFS) for GlusterFS. 

Need it fulfills:
Following on the popularity of Hadoop, a number of "big data" processing systems (e.g., Berkeley Data Analytics Stack, Storm, Stratophere, Disco) are being created and adopted.  These systems are written in a wide range of languages such as Java, Scala, Python, and Erlang.  

These systems are rarely used in isolation. Maintaining separate distributed file systems and databases is laborious, costly, and wasteful. Migrating data between separate distributed file systems or databases is difficult, error prone, and limits easy access to data when it is needed. As a result, there is great interest in integration as exemplified by projected such as the Gluster plugin for Hadoop.

Gluster's existing clients (FUSE, libgfapi) are limited to specific operating systems (Linux) and/or require bindings for each programming language other interest.  Such RESTful/JSON APIs and servers such as WebHDFS offer a more general solution that is independent of the client's operating system and programming language.  WebHDFS has proven popular and is being used by systems such as Disco to add support HDFS.  A RESTful/JSON interface and server for could offer similar benefits for Gluster and has the potential to be just as popular as WebHDFS. 

Any relevant experience you have:
I am familiar with WebHDFS and Hadoop Gluster plugin. Through my Ph.D. research and TA'ing experience, I am familiar with distributed systems (e.g., WorkQueue), client-server systems, and RESTful/JSON APIs.  I have some experience with CherryPy, a Python web service framework, and using it to create a RESTful/JSON servers. I am also familiar with the work in Disco to add HDFS support through WebHDFS.

How you intend to implement your proposal:
Aim 1: Design a RESTful/JSON interface that supports the semantics of Gluster.
The ability to report data locality information will be important for other projects that use that information for scheduling workers and tasks.

Aim 2: Create a RESTful/JSON server.
I will use Python and its libraries such as CherryPy or Flask to develop a RESTful server. My preferred option will be to use Python bindings to libgfapi as a backend, but I will fall back to using the Gluster FUSE client if I run into problems.  A dummy backend that uses the local file system will be created for testing purposes. (It would be good to support multiple backends.)  

Aim 3: Create a RESTful/JSON Python library.
I will create Python library that uses the RESTful/JOSN interface as a backend.

Aim 4: Create Unit Tests and Benchmarks for Several Use Cases
As part of my effort, I will write unit tests to ensure that the server and client library are implemented correctly.  As a good performance will be important for adoption, I will also document several use cases and perform benchmarks to evaluate the performance of the RESTful/JSON server compared with the standard FUSE client. 

Aim 5: (Optional and time permitting) Work on integration with a big data system a proof-of-concept
Option 1: Integrate with Hadoop by mimicking the WebHDFS API so that the Hadoop WebHDFS client can transparently use the Gluster RESTful API as a backend

Option 2: Integrate with the Disco as an Erlang/Python MapReduce framework.  Support for HDFS is currently being added using the WebHDFS interface.  The WebHDFS work provides a good template for adding Gluster support.

--
em address@hidden
c 954.496.2314
_______________________________________________
Gluster-devel mailing list
address@hidden
https://lists.nongnu.org/mailman/listinfo/gluster-devel



--
em address@hidden
c 954.496.2314

reply via email to

[Prev in Thread] Current Thread [Next in Thread]