Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Query your database using plain English, fully on-premises (vizly.fyi)
104 points by alishobeiri on Aug 30, 2023 | hide | past | favorite | 54 comments
Hi folks,

My friend Sami and I recently built Vizly, a Mac application that allows anyone to query their databases using plain English.

Vizly is built on Llama 2, llama.cpp, and runs fully on-prem (edit: meaning everything is local and your data never leaves your own computer).

We are running two Llama models, one for natural language to SQL translation, and another that uses the results from the SQL to render visualizations. That means there are no external APIs and all the AI models are running locally on your MacBook.

We tried to make Vizly very easy to share as well. Every Vizly instance creates a share link that can be accessed by anyone on the same network as you. Just send the share link to anyone on the same network and they will be immediately able to run AI-powered queries, hosted from your device.

Vizly previously used to be a hosted solution for querying CSVs and now we are on-prem specifically focussed on databases.

Would love if you could try it out and give us any feedback!




This is a cool experiment, but at least in my company I’m not sure how useful it would be.

There are some non-developer users who can run queries on a read-only copy of the database. However, for anything complicated they usually have to ask the developers whether the query they have written actually matches the English description of what they want. Sometimes it does, but often there’s some nuances that their query doesn’t capture.

Most of the expense of getting this data isn’t writing the query, therefore, it’s validating it.

If you need the help of a tool like this to write your query, how are you possibly going to know whether the results are what you want without taking the generated query to an expert?


I was wondering the same thing with these kind of products.

Garbage in - garbage out.

Same concern I have with all the companies using LLM for searching company documents.

Most of the data I have seen at the corporate level does indeed have nuance to it and is often not as clear as just the named column. Usually not much of any documentation and that is always a tough problem to solve. Imo it’s about how to maintain knowledge bases properly, it’s a tough but to crack.


a fool with a tool is cheaper than an expert. especially if expertise is needed to adequately estimate cost.

that’s the gamble all of these “natural language” tools are making. sell to people that aren’t experts, attempt to deflect criticism from those that are. what’s a $20k subscription compared to a $200k person?

what comes next depends on the company. some will invest the money in the product to make it more useful. others will search for more fools.

most importantly, nobody wants to be left behind. there’s a lot of products that need more time in the oven, but that’s never stopped a good salesperson.


Sadly, they won't care - they'll trust the tools, like they've blindly done for decades. "Computer says no", and all that.

We're in for a scary wild ride.


Looks like a great start. Congrats on launching. I think this is a potential killer use case for LLM. The privacy aspect it great. One challenge I’ve personally experienced trying to scale my own on prem AI product is that deployments that need prod data access almost always require some handholding/bespoke work as well as insanely arduous legal contracting, especially as you get in to enterprise who most value to privacy element. Making scaling hard. One GTM channel you might want to consider is something like the AWS Server Offer or other similar cloud/data vendor partnerships that might help solve some of these challenges. These vendors are also falling head over heels to support Gen AI startups get on to their platforms atm.


> killer use case for LLM

I hope you're referring to 'someone can eventually get killed by this'.

For the love of humanity, please don't use this in any real world case. Llama will generate mistaken queries in at least single-digit % of cases (optimistically).

If you use this, it's a matter of time for a mistaken query to coincide with very bad timing and context, leading to bad outcomes.


You’ve conveniently dropped the word “potential” from the start of your quote.

> any real world use case

I can see plenty of real world use cases where a trade off between ease of use and accuracy might be acceptable, there are a plethora of technical solutions to improving/checking accuracy and workflows can be designed to prevent “bad outcomes”. Certainly enough potential here for a set of plucky founders to give this a crack.


> plenty of real world use cases

Any examples come to mind?

I can't see a case when data retrieved from a database will be wrong, someone will do something with it, and it's fine.


I can see the most potential for this type of product where it had access to multiple company systems. You could ask a whole bunch of stuff. Have we ever done a customer survey on x, where can I find the results, who wrote the report, summarise the findings, do they still work here, what’s their email.

I get this product looks like it’s single db, not yet connectable to knowledge systems, file stores etc. but it doesn’t take much imagination to see it going there. Having an interactive company brain would be useful even if you don’t want to trust it to answer more specific questions like “what percentage of our customers are x”.


> do they still work here, what’s their email

And what happens when you send the wrong message with sensitive data to the wrong Brian because of that?


What sort of safeguards do you have to prevent the LLM from emitting sql queries that edit the data?


So the way that we enforce it, is that we only generate SELECT queries, and do a post process verification to make sure that the result is actually only a read-only query.

I think you have a great point though, this should be made much clearer so users can build trust in using the product. Thanks a lot for the feedback!


> So the way that we enforce it, is that we only generate SELECT queries, and do a post process verification to make sure that the result is actually only a read-only query.

So it will select from functions?

  CREATE OR REPLACE FUNCTION delete_all_from_table(_tbl regclass)
  RETURNS integer AS
  $func$
  DECLARE
    _r record;
    _count integer = 0;
  BEGIN
    FOR _r IN 
        EXECUTE format('DELETE FROM %s RETURNING *', _tbl)
    LOOP
        _count := _count + 1;
    END LOOP;

    RETURN _count;
  END
  $func$  LANGUAGE plpgsql;


SELECT delete_all_from_table('users');


What safeguards prevent the LLM from generating Cartesian products?


Isnt that as easy as giving a non-write user to the app?


This is another way to make sure that the app doesn't overstep it's permissions,

It's still very valid feedback, on our side we should make the permissions of each user very clear and make it clearer that we only do select queries. Will make sure to make those changes! Thanks for the feedback!


This reply gives me a bit of anxiety. Solving this should have been one of the most important things regarding something like this


Surely you wouldn't give a third party analytics app write access to your prod DB? It would be a nice UX improvement if the app checked the permissions and gave you instructions on how to set the permissions properly, but this seems to be entirely at the level of setting it up correctly.


Obviously, generating queries from natural language should be the most important thing

And providing the user interface to do that


I think I would be okay with not having the ability to generate queries from natural language if it came with the possibility of my data randomly getting clobbered!


This is interesting! Congrats on launching! I would personally love to see just simple csv dump as input :) Would you be able to generate possible questions automatically ?


Yes so it is possible to generate questions automatically, just so I understand the CSV use case better. Is it because you don't trust the application enough to connect a database to the application?


In tons of enterprise and just general office settings, you'll be given a CSV file, rather than a database dump. Heck, even scientific lab equipment will output CSV for you. It's basically the lowest common denominator for all kinds of tabular data.

So anything that makes using CSV files easier with your product is likely going to be a welcome change for a lot of people.

Realistically, your database engine can probably ingest the CSV file in a simple import statement, so you're likely 98% done already with this. :)


I do work for an mid-sized company, and they are drowning in CSV files. They don't have a database, really. They have lots of different enterprise software systems that they use, each of which exports differently formatted CSV files that need to be combined in various ways to produce either different CSV outputs (for importing into different systems or to be sent to some other organisation) or some other output (like a list of PDFs that are sent to some other organisation or individual) or reports for business intelligence purposes.

I use BigQuery as the database, which is a competitor to SnowFlake which you already support. I also already use ChatGPT-4 a lot nowadays to create and edit the SQL queries (although they can sometimes be so large that ChatGPT can't cope with them so I sometimes use cut-down/contrived examples to write chunks of a query then construct a much bigger query from those smaller examples).

So you could replace my job if you could allow people to name their data warehouse provider of choice, drag and drop umpteen CSV files into it, write some instructions like "Give me a CSV file with all the BLAH matching BLAH" or "Create PDF documents with BLAH information in them" and get the outputs they need on the other end.


Not the original commenter, but I have often been asked by non-technical colleagues to answer questions about a spreadsheet they've sourced from someone/somewhere else. "So-and-so shared this spreadsheet with me. Can you please take a look at let me know if ...". Sometimes this is delegation, but frequently comes down to a lack of skill.


This is a very interesting product! Congrats on the launch!

I was wondering if this would also be supported for no-SQL databases like Mongo?

Also a little sandbox to play with it would be nice. For example, you could have a small table of weather data from a random city, and people could then query something like "How many days in 2022 was the daily high for Los Angeles above 100 degrees?" Of course, this might be a lot of work as you would have to have it run on the backend and then return the results to the frontend.


Yes definitely, that is something that I think would be interesting for us to add. Since we are building everything fully on prem, we didn't have a web based product to confuse people, but I think your point stands that have a tangible demo to play with will help users trust the product to eventually run it on prem.

Thanks a lot for the feedback!


How much can I trust the insights a tool like this gives?

The landing page focuses on ease of use and privacy (awesome!). But is this a tool I can use to make critical informed decisions or something where I am willing to trade correctness for ease of use?

Being able to click into a chart and see the generated query along with explanation of the intent of each step of the query would go a long way in building trust in the result.


Hey, thanks, I was pretty intrigued by this. A couple questions:

1. Your website doesn't say which DBs you support.

2. What information do you feed the model to determine how to map from English language queries to SQL? Do you just use the schema from information_schema? Do you use any DB object comments (e.g. we annotate all our tables, views and columns with Postgres comments)? Do you sample any actual data?

Thanks


Hey! Thanks a lot for the feedback, that is something we actually forgot to add, I will add that now to avoid confusion!

The database connectors we support right now are: - MySQL - Postgres - Snowflake - Apache Impala

In terms of how we map English queries to SQL, we only look at the schema at the moment.

We have the ability, and have experimented with adding enriching information such as sampling the data, as well adding comments to the schemas.

We've noticed that both approaches actually do increase accuracy but just to keep things simple for the initial release we haven't added those yet.

Just so I know for future reference, what system do you use for DB object comments?

Thanks again for your feedback!


> Just so I know for future reference, what system do you use for DB object comments?

Postgres supports this natively (there was actually an earlier HN post today about "little known Postgres features" that mentioned it): https://www.postgresql.org/docs/current/sql-comment.html

We use it extensively because tools like Datagrip will display these comments in the schema browser. It would be pretty trivial I imagine to update your tool to pull these comments from the Postgres schema.


> (there was actually an earlier HN post today about "little known Postgres features" that mentioned it)

"Lesser Known Postgres Features (2021)":

https://news.ycombinator.com/item?id=37309309


Looks cool, I was kind of wondering what is the expected use case for this? Since it does not exactly connect different tables, it just queries the same table and draws insight on it - I am not sure how I see myself using this


Nice work! that project is looking great. Any plans to open source this project?


This is cool, but I think the usage with excel and csv sheets are more common. Why are you guys switching to the database?


How long do you guys think until tools like this eliminate data analyst or business intelligence developer jobs?


the natural language to SQL space is interesting, seeing tools pop up. The on-premise property is key. It is fragmented and thus difficult to write integrations for, but text-input fields of SQL clients / notebooks / BI tools (think metabase) are looking to be NL-to-SQL equipped. At least it will be a race of vendors to provide this.


Will it be availiable for other platforms in the future? What will the pricing model be?


What library are you using for visualizations? Does the AI generate the visualization?


Is it free ? One time purchase or Whats the pricing model ?


Sorry for the tangential remark, but you may be interested in fixing the English mistake in the headline - while “on premise” is an extremely common mistake, it is still a grammatical error. In the sense of “building or property”, the word is always plural, ‘premises’. A ‘premise’ is only ever a logical proposition or statement from which another follows.

It is actually nice to see the abbreviation “on prem” being used, because at least the error is abbreviated away!

But it does sound like an interesting project!


(As you pointed out), 'on-premise' is used very often (according to google trends as much or more than on-premise). There's an argument to be made that language is about how people use it and not the theoretical rules of grammar - and therefore if a 'mistake' is done often enough it should no longer be considered a mistake.


I read the headline as logical premise, or assumption.


Fair enough, language should strive to be unambiguous. I guess I spent enough time in corporate-world that 'on premise' just means 'on-site'. Perhaps a hyphen would've helped.


Ok, we've updated the title (submitted title was "Show HN: Query your database using plain English on premise") and I'll fix the text too.


Just use "on-prem"


A "premise" is part of a logical argument. You want to say "premises", which refers to a location. I'm sorry to be that person.



This is not correct. 'On premise' (normally hyphenated) refers to servers hosted by the same organisation as the application owner, and is opposed to cloud.

It is correct to say that an application runs 'on premise', if it can be self-hosted (often 'on premise' or 'on prem' does not literally mean that the server is on the same premises as the company's headquarters, just that the company manage the server).


Having looked this up in a couple dictionaries now, the person you're responding to is absolutely correct in their objection.

The word "premise" is misused in "on premise" speaking of computers running at a owner's location. I don't think the distinction of it referring to a literal location or metaphorical location (managed by an owner at a different location) matters.

While this might change over time and with on-going vernacular usage, but as of now: our industry is misusing the word. Including myself (until now).


While the person I'm responding to may be upset that 'premise' seems to be acquiring a second meaning in this specific circumstance, they are almost certainly not correct that OP 'want[s]' to say premises. OP almost certainly said exactly what they intended to say.

The use of 'on-premise' rather than 'on-premises' even in formal texts is at least 200 years old (examples below). How much time should it take for us to accept "on-premise", rather than "on-premises"?

(https://www.google.hu/books/edition/Records_and_Briefs_of_th..., https://www.google.hu/books/edition/Documents/3jIbAQAAIAAJ?h..., https://www.google.hu/books/edition/Council_Proceedings/rLro...).


Not everyone speaks English as a first language. It’s a Show HN too. No need to say sorry!


TIL that "premises" is plural-only if you want to refer to a location. I might have made that mistake as well (also not a native speaker) - and it's one of those that spellcheckers don't catch...


A Mac application? Big enterprises are mostly on Windows desktops. Especially those that would be interested in this app.

Otherwise the idea is great, now please someone create this for Splunk...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: