Why DataShell?

Why DataShell?

In the sprawling landscape of data tools, why do we need another one?

For several years, Matt Turck has been compiling the Machine Learning, Artificial Intelligence, and Data (MAD) landscape in an effort to make sense of this vibrant space. The image above represents the most recent version of the landscape. The diversity and intricacy of the ecosystem can be overwhelming, if not downright intimidating. Then, why create another product? Why another company in the space? Why DataShell?

I was a software developer and architect for 12+ years, before transitioning to data analytics consulting some 10 years ago. When I started working on data projects, I was very surprised for the lack of automation, scripting, and configuration- or code-driven tools available to use. Even when I was using 'modern' data products, it was hard or impossible to apply the software development best practices I've learned in the past, to my data projects.

And here's my thinking. Even though data projects are conceptually different than software projects, I would argue that data platforms are basically nothing more than software solutions. As such, they should benefit from all the true and tested techniques that are commonplace in software dev, such as GitOps, CI/CD, unit and integration testing, etc.

Think for a moment on the way a software developer approaches the development of a new feature. The process will be something like this (please note this has been grossly simplified!):

  1. Create a new feature branch from the base development branch.

  2. Point local dev to a version of the app database that matches the dev branch or create a version of the database from scratch in local dev, based on schema management scripting/ORM capabilities (this step typically involves initializing the database with "seed" data).

  3. Develop the new feature. Code and database schema changes are captured in configuration or code, and checked in regularly in source control.

  4. Once the development is complete, a pull request (PR) is created. This PR is automatically tested, peer-reviewed, and demoed (if needed), then integrated into the base development branch. Eventually, the feature finds its way to production, once the next release is put together.

Now, wouldn't be nice if Data Engineers could work the same way?

Granted, there are a number of tools (dbt, Dataform, Cube, Looker, etc) that have embraced the code- or configuration-first approach, and they have made a huge difference, but I think we can all agree we are not there yet.

This is what DataShell is all about. Our goal is to enable Data Engineers to apply software development best practices to data projects, by integrating best-of-breed Open Source software in an open, cohesive, server-less platform.

Stay tuned for more!