Rendered at 09:51:26 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
aymanhs72 5 hours ago [-]
Sorry being late. Rollups were only pushed recently, as I thought about them for a long time until it finally clicked. This is main advantage over straight log files.
Another things is that it is designed to be extremely efficient with SD cards. It uses WALS for keeping things safe, which is optional, and it writes very compressible data to the main database. My tests averaged 2-3 bytes per point, and the more data you have the more compression you get. So you can store many days of 10s data at less than few MBs.
amluto 17 hours ago [-]
Wow, Home Assistant should try something along these lines. Home Assistant’s current handling of time series data is comically poor.
Another decent option might be Clickhouse. Sadly, as far as I know, DuckDB has no real understanding of sorted or ordered data, so it might be challenging to avoid absurd amounts of read amplification.
ul5255 14 hours ago [-]
I have several years worth of timestamped sensor data in a SQlite DB that is 12+GB and growing. One idea was to experiment with DuckDB. Your comment about DuckDB has me worried as time ranged queries are common. Any link for me to dig deeper?
amluto 11 hours ago [-]
Not a specific link, but a DuckDB table is a bag of rows (logically — it’s physically arranged as a column store), and those rows are not ordered in a way that is expressed in the schema. If you do a big analytic query, DuckDB will (extremely efficiently) scan the whole table and will blow many other tools out of the water while doing so. But if you want to see the sensor value of a specific sensor at a specific time, you want an index of some sort, not a full table scan. And if you want to do a rollup of some but not all sensors, you end up modifying some stuff in the middle of a table, which is not amazingly efficient. DuckDB has an optional index, but I don’t think it’s meant for this.
You could certainly create a directory with a Parquet file for each (entity id, time range), and you could probably convince the DuckDB query engine to understand that (using Ducklake? raw Hive can only barely do this), but I don’t think that DuckDB will binary search for you. (And binary search is actually pretty lousy for this use case.)
DuckDB has indexes. One specifically for highly selective point lookups. The structure is called an adaptive radix tree. The workloads you’re describing are sort of duckdb’s bread and butter (even the parquet stuff it’ll handle out of the box)
I've been thinking about this question for a while. It's confusing at first blush because it's an append-only database and it has a WAL — and it feels like a WAL is already an append-only database, so what's even happening?
Looking back at the project now, I think the value comes from querying it, and especially from automatic aggregations.
This is a reasonable use case that can't immediately be resolved by just logging to a file. It creates an aggregation profile, so a sensor could log temperature every minute and the database will automatically average temperature by the hour. That's a straightforward and meaningful use case.
There's also some query support, but that may be closer to something you can sort of do if you just have a log file.
I think the aggregations are the most direct value proposition. OP/author: worth making this pitch "above the fold" in the README, imho.
Also, I've done a lot of analytics work, and a fun feature to add that I've built in the past is an approximate median. I might open a PR and remind myself how to build that. Cheers!
leetrout 20 hours ago [-]
Seems the IOT / embedded device constraint is what is driving the query feature. You don't have to go scan all of the file and depending on where it is running having the rollup functionality could be a big help
neoyagami 20 hours ago [-]
You can query it? And may be faster?
aymanhs72 41 seconds ago [-]
I expect queries to be mostly around a single or few sensors around date range. There is no index but the files for live readings are daily based, and reading single day file is extremely fast. Rollups are also supported and act like normal databases with a configurable period, monthly or yearly. And those can be queried pretty fast too
PunchyHamster 17 hours ago [-]
the history of every append only database:
* we will make it append only, the type of data makes sense for it and it will simplify the design
* whoops, devs fucked something up and added a bunch of nonsense that have to be removed, let's figure out how to make at least occasional deletes work
aymanhs72 4 minutes ago [-]
I already had that experience of some bad data and 85c degrees getting into my temperature sensor readings. BTW 85c just means it's invalid. So I do need to delete the bad data. The way it can be done now is to just export, filter and then import. I'll to resist the temptation to fix this
decisionSniper 14 hours ago [-]
these match my experiences living with these in production.
Another decent option might be Clickhouse. Sadly, as far as I know, DuckDB has no real understanding of sorted or ordered data, so it might be challenging to avoid absurd amounts of read amplification.
You could certainly create a directory with a Parquet file for each (entity id, time range), and you could probably convince the DuckDB query engine to understand that (using Ducklake? raw Hive can only barely do this), but I don’t think that DuckDB will binary search for you. (And binary search is actually pretty lousy for this use case.)
Clickhouse has explicitly ordered tables:
https://clickhouse.com/docs/engines/table-engines/mergetree-...
https://duckdb.org/2022/07/27/art-storage
Looking back at the project now, I think the value comes from querying it, and especially from automatic aggregations.
This is a reasonable use case that can't immediately be resolved by just logging to a file. It creates an aggregation profile, so a sensor could log temperature every minute and the database will automatically average temperature by the hour. That's a straightforward and meaningful use case.There's also some query support, but that may be closer to something you can sort of do if you just have a log file.
I think the aggregations are the most direct value proposition. OP/author: worth making this pitch "above the fold" in the README, imho.
Also, I've done a lot of analytics work, and a fun feature to add that I've built in the past is an approximate median. I might open a PR and remind myself how to build that. Cheers!
* we will make it append only, the type of data makes sense for it and it will simplify the design
* whoops, devs fucked something up and added a bunch of nonsense that have to be removed, let's figure out how to make at least occasional deletes work