[Bug 49687] Zeitgeist based LogStore

Thu May 10 16:45:46 CEST 2012

https://bugs.freedesktop.org/show_bug.cgi?id=49687

--- Comment #9 from Seif Lotfy <seif at lotfy.com> 2012-05-10 07:45:46 PDT ---
(In reply to comment #7)
> (In reply to comment #6)
> > One of the main reasons I thought of splitting the DB is to wrap the TBI
> > (message DB) is to be able to fullindex the messages, with and FTS library such
> > as Xapian or Lucene.
> 
> > Having the Events and the Messages in one DB just makes it not so clean in
> > terms of separation of concern. What if we decide that Xapian is not so good of
> > a FTS indexer. This would mean wrapping the whole DB all over again.
> 
> OK, can you expand this thought in a rationale on the wiki?
> What are the requirements for having the FTS indexer working?
> Actually, an FTS section mentioning Xapian or any other FTS indexer is missing.

I will update the wiki on FTS as soon as possible...

> What's the difference between having what you proposed (evend_id+body table)
> and a table with more columns? Would a JOIN on two tables work?

Joining on two table works.
In theory we could have one table. Practically it would make cost us a LOT of
space and resources.
Why?
Because e.g: storing the account path or the tarted_id as strings in the table
over and over would require a lot of space, and would reduce the  speed. We
would need to map integer ids to those strings, which will be in another table.
Integers are easier to compare and don't need as much space as strings in the
DB.
Also having the message strings for with the events in one table would make
looking up events query slower than having an events table alone. This would
happen if we are trying to populate the "recently called list", where we are
not interested in messages.

Splitting the log from the content of a message would allow us to perform much
more efficient. Zeitgeist internally already would map the account path and
such to integer ids, this would take a lot of overhead work away from the log.

> > Anyhow we could create a dedicated integrated event DB. Which does its own 
> > sync read/write for the Log. We could do that with a our own integration or 
> > with the upcoming libzeitgeist2 which creates a DB for you without the daemon. 
> 
> I thought you already put a section about those possibilities on the wiki.
> Please, add a section about it as well, so we have a complete scenario.

Yeah I need to find a way to express myself properly for the scenarios. But I
think the rest of my answers here will clarify alot.

> 
> > In both cases if your worries are doing stuff async then swtiching to sync 
> > writing should not be an issue.
> 
> I don't think that the async writing is what worries people, but what make them
> (including me) critical.
> As long as there is a weak point in the proposal, it's not really viable.

Ok not I am lost. what is not viable?

> > > The bigger issue is that the second part (body) is the one most likely to be
> > > lost and it's also the most important (or at least equally important with some
> > > other event's info).
> > 
> > When can it get lost.
> 
> Lost = the daemon shuts down before the callback is fired (including ZG never
> called us back).
> How can it happen? Normal dbus service life cycle, desktop lifecycle, etc.
> This part we can work on.
> 
> Last but not least TPL crashes: the longer it takes (in term of steps, rather
> than time, I know ZG is fast) of storing the whole info, the higher the
> possibility of data loss on a crash. This one we cannot do much, but we need to
> make TPL arch less susceptible to inconsistencies on such situations as well.
> 
> > > I don't care if I don't remember the avatar used with the message or the
> > > geolocation of the event.
> > > I care if I don't have the body or I cannot associate the it with timestamp or
> > > from/to.
> > > 
> > > Is there any way to invert the process?
> > > 
> > > 1- write the body with the minimum set of needed info into the Body Index (even
> > > if duplicated in the Log later), assigning a primary key X
> > > 2- write the Log, telling ZG that this event is related to X (or giving it our
> > > own event_id).
> > 
> > Sadly you can't give Zeitgeist an event_id to an event.
> > 
> > Well a good solution is to have a temp_table which stores all the info as it is
> > (strings) as soon as they arrive. When an interaction happens we will first
> > dump it in the temp_table. Then we insert into the log then into the TBI. Once
> > both insertions took place we remove from the temp_table. This way if TPL quits
> > or crashes, or the Log is not reachable the middle of a process, the next time
> > tpl start it will find the temp_table not-empty and try to empty it.
> 
> This is a similar approach to what we use for pending messages, I think Nicolas
> was thinking of a similar thing on Comment #4
> 
> My idea is not considering the temp table temporary at all, but part of the
> log.
> You have already the data, why removing it?

I get your concern here. The answer is tricky. The temp_table saves events in a
simple almost mapped raw format for easier looking up. The raw format in
storage would cost us a lot of space and is not optimized. We would need to
optimize it for querying and such which would end up being an implementation of
a whole new sqlite DB similar to what I just proposed.

> This also would make TPL queries (the log_manger_get_FOO()) not asking two
> places, but just one.

Well yes but on the cost of speed and space, check my explanation above.

> > You might
> > ask why not keep the temp_table as our main storage. Well:
> > 1) it is hard to do a FTS index around it.
> 
> I look forward to seeing it on the Wiki. Would it help to have two tables?
> One for the body and one for the rest (timestamp, id).

Not really. If we use Sqlite's FTS then yes it would. If we use Xapian or
Lucene, both need their own DB format. Which is efficient for FTS but not for
normal querying. Also FTS DB's use a LOT of space, compared to a normal SQlite
DB. My ZG DB is 20 MB and my FTS DB for Zeitgeist (we also used the split
method) has 43 MB. And Zeitgeist does not really log big text but only headers
and mimetypes and such.

> > 2) it will have duplicate string entries for example the target string. Which
> > can be costly and should be rather stored as an int.
> 
> I don't understand what you mean. Is it a write() problem?
> We already write the data fully in the temp table.

the temp table if not optimized will be storing a lot of strings which, will
cost time in querying since looking up strings is slower than looking up
integers. We would need to have tables mapping properties values to integers.
Which would lead to more or less a log like zeitgeist :D

> It this is an issue, it can be avoided re-factoring into multiple tables
> 
> | tpl_id | event_id | body | (table 1)
> | contact_id | contact_id_number (yeah, silly name) | (table 2)
> | tpl_id | timestamp | contact_id_number | (table 3)
> Table 2 is written when a new contact enters the log (this write() happens once
> in the table lifetime for each contact who will eventually contact us).
> An in memory cache (hashtable) can be used in the LogStore for the recently
> contacted people, so to avoid continuous queries (read) to table 2 as well.
> 
> This way we have a body and then only integers to deal with, on the average
> situation.

table 2 and 3 are more or less provided by the log, internally. table 1 however
as explained before if we decide on this temp_table not being a temp_table at
all would force us to use the sqlite FTS ad would require lots of
implementations from our side to optimize.

If we keep the temp_table a temporary ans entries are purged when written to
the actual log and TBI we will reduce the implementation efforts, and the
queries will be efficient since the temp_table ideally would always have 1
entry in it.

> The real problem is how to deal with data loss :)
> 
> > > This is a scenario in which we have a private DB for what we need and delegate
> > > to ZG all the extra data, rather to have the private DB to keep what ZG cannot
> > > store/is better not store in ZG.
> > 
> > I can't follow. Can you elaborate?
> 
> It's just a considaration on how ZG and the private DB are used.
> 
> WRT my former idea (a) of 
> 1- writing the whole needed info into SQLite
> 2- push the event info to ZG
> 
> and your idea (b) of
> 1- push the event info to ZG
> 2- writing the info that ZG does not store into SQLite
> 
> a) is a way to have a local SQLite DB with the majority of the info we need,
> and use ZG (from TPL) to get the rest of the info. If for any reason ZG is not
> running, we still can work.
> ZG has partially duplicated info, but it wouldn't be a bit issue in my opinion.

Agree this wouldn't be an issue. But would require lots of implementation.

> in b) ZG has a main role, and the private SQLite is there only because ZG
> cannot do FTS for the moment.
> Delegating the whole log to ZG would be OK, the fact that it cannot handle the
> body index for the moment is what actually creating the problem (see callback),
> fixing that is probably the ideal solution.
> Although, we completely relay on ZG, which means that if ZG is down, we cannot
> do it. Can it be an issue?

Well Zeitgeist could take over the whole TBI in the form of an extension. Which
means Zeitgeist will have an extension just like its FTS extension, that hooks
into the "post_insert_event" and writes the body into a new table. This would
however move the DBus API to Zeitgeist domain. :/

But I think I have a good idea. We could easily create an observer extension in
Zeitgeist that listens to Telepathy events and does the reading and writing in
Zeitgeist internally. Or the LogStore would forward an events to Zeitgeist (via
the exposed extension API). In both cases Zeitgeist will maintain the Log and
the TBI. Zeitgeist then replies with an event_id and LogStore can remove the
event from its "temp_table".
Reading will be done directly. 

So to sum it up. If the LogStore is to push the events to Zeitgeist. 
1) TPL would tell Zeitgeist to log an event with the body, while keeping a copy
of the event and the body in a temp_table.
2) Zeitgeist does its own magic, and returns an id for the event (which means
event and body were successfully stored).
3) TPL gets the id back (via callback) and removes the entry from the
temp_table.

else if Zeitgeist would have an extension with an observer:
1) The extension in Zeitgeist would listen to events from Telepathy.
2) When an Event occurs Zeitgeist does its own magic like it does with its
current FTS index maintaining 2 DBs.

In both cases deleting stuff will be handled internally in Zeitgeist by the
extension.

To read, TPL would have direct access to the DB of Zeitgeist using
libzeitgeist2 and would extract the info it needs.

-- 
Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA Contact for the bug.
You are the assignee for the bug.