I was just thinking about some simple sanity checks and constraints
that prevent certain missuses.
Against out of bounds errors:
- the URLs must be well-formed
- the values and size of information itself must conform to certain
limits e.g. not more than 99 tracks, artist names, track names not
larger than so many characters
Against resource exhausting:
- Limits on how much time a C/S interaction might take (depending on
server load), either activley by some sort of watchdog daemon that
terminates requests taking too long, or in advance if it possible to
estimate that time roughly
- Access limits on requests/submissions.
I would try to identify scenarios of "typical" uses and
formulate boundary conditions for them.
Let's say:
- lookup by using the client or some enabled cd-player
worst case that might be legal:
flipping through a lot of entries, e.g. limit to 500 requests
per hour per IP
- submissions by using the client or some enabled cd-player,
worst case:
motivated user, is able to hack in cd info in 5 minutes ->
limit to 12 submissions per hour per IP
- synching a node (be it another online site, or some user that
wants all or certain part of the stuff on his local machine):
This should be done in a different, a bulk-mode.
So everything bigger has to be done in lumps.
You want to submit 1000 entries at once - fine with us, but
you have to supply them grouped together, in one big submission
package - not by doing 1000 more or less stateless single
requests.
You send us 1000 crap entries? No problem, we know in advance,
if they will fit on our harddisk anyway and we can review
them and remove them later with one command, if we decide to
reject them.
And we could tell you to do such a request later, in an hour,
when server load is lower or submit us a smaller package.
You need 5000 entries right now? Ok, we prepare a package
for you and transmit it too you in a way that the server
is still accessible for single requests..
> The fact that we want a completely decentralized
> and automated system makes it hard to prevent people from malciously
> tampering with the system. In essence, every system I conjured up is
> vulnerable to attacks.
Like I said above, I see legal access to the system divided (at
least) into two categories
a) single requests (only a limited number of interactions per hour)
caused by use of cdplayer lookups and manual input of new CDs
b) bulk-mode, caused by the syncing of nodes - many items, but
localized! - could be better compressed, could be given a
hash for reference among nodes etc.
so this way the number of interactions we have to administer, or check
between nodes / hour can not grow over a given size, a size that
the system can handle.
> We may also want to build in some journaling/back-out features that
> allow a CD Index administrator to say, "back out the last three days of
> changes" if someone started tampering with the data 3 days ago. Yes,
> work will get lost, but if there are enough servers in the system that
> shouldn't be noticable.
Just to get a working system fast, why not map it onto the
cvs infrastructure?
If we could export the contents of the database in a suitable fashion,
we could use CVS to keep these contents plus the revision management
and we could use CVSup
http://www.polstra.com/projects/freeware/CVSup
for effective synchronizion of CVS trees over the net, between nodes.
So if the export format is an ASCII format, like the XML stuff, it
can be handled by diff and patch and thus by CVS.
It does not have to be a single file. It could be split over time
(like logfiles) or lenght (blocks of N entries) or to some other
scheme (artis a..z).
> Am I off in the deep end here?
No, the construction of an open distributed CD Index is the most
challenging part of this project!
Genuine Internet stuff.
Lots of interesting subjects to learn and experiment with.
Regards,
Marc