CDIndex future & DTD

pouwelse (pouwelse nospam at twi.tudelft.nl)
Sun, 05 Dec 1999 14:36:48 +0100

This is a multi-part message in MIME format.
--------------EE9BDFFEA9A0129E76C2B672
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hello,

For the future of CDIndex I think these points are important:
1.generic implementation
2.extendable data format.

1.generic
This week I finished a paper for the USENIX conference about public
writeble databases such as CDIndex.
In this paper a description is given about a generic implementation of
a public writeble database. Public writeble databases such as
cdindex.org, slashdot.org,
and imdb.org are all made for a specific database content. I tried to
describe an \emp{generic} system where the content is separated from the
tools.
With this separation it is possible to re-use the software and create
databases for all sorts of information.
The HTML version of three pages can be found on:
http://www.pds.twi.tudelft.nl/~pouwelse/OpenInformationPools.html
Currently we need to concentrate on a working dedicated implementation,
instead of a full generic implementation, but I hope we can keep this
issue in our minds when important design issues are decided.
Do more people think a generic solution should be the ultimate goal with
the
first target a good CDDB replacement???

2.extendable data format
If we want to build software and freeze the DTD it is important that
this DTD is expandable with more information fields.
The current DTD is correct, yet it needs to be updated. All data is put
inside the
<CDInfo> tag. In databases relations are 'normalized', 1:N or N:M
relations are seen as special. Should we modify the DTD to take into
account the 1:N relation between artists and
CD and artists and tracks?
If we want to extend the artist information with date of birth,
biography, web site, fanclub site, picture, Email, etc. we run into
problems in the current DTD. In MultipleArtistsCDs and SingleArtistsCDS
the artist name is repeated. With more information fields for the artist
the XML format will contain a lot of redundancy. When artists have
specials tags outside <CDInfo> with references, the redundancy is taken
away and the relation is normalized.
I propose to use the ID and IDREF token attribute values of the XML 1.0
standard to as reference to an artist from a CD. The enhanced DTD is
attached to this Email.

A problem with the current XML parser is that it cannot handle
'context'. There is no difference between the <Name> tag inside a
<track> or inside a <Artist> tag.
In other words in the current DTD the <Name> tag is reserved for track
names and
cannot be reused inside the <Title> or <Artist> tag.
With the added source code I tried to make the parser more general, I
hope this 'context' principle can be included in the source tree. The
CONtext support functions are located
inside the parse_CON.pl file.

p.s.
The lyrics/subtitling support is going OK. Inside the <Track> I added
the <Text> tag for lyrics and the <TimeStamp> tag to support
karaoke/music subtitles. After the parser is debugged I hope this can be
added to the server.

Greetings,
Johan.
--------------EE9BDFFEA9A0129E76C2B672
Content-Type: text/plain; charset=us-ascii;
name="CDInfo2.dtd"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="CDInfo2.dtd"

<!ELEMENT Artist (Name)>
<!ELEMENT CDInfo (Title, NumTracks, IdInfo?,
(SingleArtistCD | MultipleArtistCD))>
<!ELEMENT IdInfo (DiskId+)>
<!ELEMENT DiskId (Id?, TOC?)>
<!ELEMENT TOC (Offset+)>
<!ELEMENT SingleArtistCD (SingleArtistTrack+)>
<!ELEMENT SingleArtistTrack (Name)>
<!ELEMENT MultipleArtistCD (MultipleArtistTrack+)>
<!ELEMENT MultipleArtistTrack (Name))>
<!ELEMENT Id (#PCDATA)>
<!ELEMENT Title (#PCDATA)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Offset (#PCDATA)>
<!ELEMENT NumTracks (#PCDATA)>
<!ATTLIST Artist Artist_id ID #REQUIRED>
<!ATTLIST SingleArtistCD Artist_id IDREF #REQUIRED>
<!ATTLIST MultipleArtistTrack Artist_id IDREF #REQUIRED>
<!ATTLIST Offset Num NMTOKEN #REQUIRED>
<!ATTLIST Track Num NMTOKEN #REQUIRED
Duration NMTOKEN #IMPLIED>
<!ATTLIST TOC First NMTOKEN #REQUIRED
Last NMTOKEN #REQUIRED>

--------------EE9BDFFEA9A0129E76C2B672
Content-Type: application/x-perl;
name="parse_sup.pl"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="parse_sup.pl"

#!/usr/bin/perl -w
#____________________________________________________________________________
#
# CD Index - The Internet CD Index
#
# Copyright (C) 1998 Robert Kaye, Johan Pouwelse
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
# $Id: parse_sup.pl,v 1.1 1999/04/29 00:09:13 robert Exp $
#____________________________________________________________________________

use strict;
use XML::Parser;

require "db_import.pl";

#include the Parse_Context support sub's
require "parse_con.pl";

#variables used to store the CD information
my $CDInfo;
my $IdInfo;
my $NumIds;
my $DiskId;
my $SingleArtist;
my $MultipleArtist;
my $TrackInfo;
my $TextInfo;

#parsing variables
my $CurrentOffset;
my $CurrentTrack;
my $IsSingleArtist;
my nospam at Context;
my $RetVal;
my $dbh;

sub SetDbh
{
$dbh = shift;
}

sub ParseInit
{
my ($Expat) = $_;
my %Temp1;
my %Temp2;
$CDInfo = \%Temp1;
$TextInfo = \%Temp2;

$RetVal = '';
}

sub ParseFinal
{
my ($Expat) = $_;

return $RetVal;
}

sub ParseStart
{
my ($Expat, $Element, %Attr) = nospam at _;
my $key;

push nospam at Context, $Element;

if ($Element eq 'Id' || $Element eq 'Title' || $Element eq 'NumTracks' ||
$Element eq 'Name' || $Element eq 'Artist' || $Element eq 'CDInfoDump' ||
$Element eq 'CDInfo' || $Element eq '')
{
return;
}
if ($Element eq 'IdInfo')
{
my %Temp;

$IdInfo = \%Temp;
$NumIds = 0;
return;
}
if ($Element eq 'DiskId')
{
my %Temp;

$DiskId = \%Temp;
return;
}
if ($Element eq 'TOC')
{
$DiskId->{First} = $Attr{First};
$DiskId->{Last} = $Attr{Last};
return;
}
if ($Element eq 'Offset')
{
$CurrentOffset = $Attr{Num};
return;
}
if ($Element eq 'SingleArtistCD')
{
my %Temp;

$SingleArtist = \%Temp;
$IsSingleArtist = 1;
return;
}
if ($Element eq 'MultipleArtistCD')
{
my %Temp;

$MultipleArtist = \%Temp;
$IsSingleArtist = 0;
return;
}
if ($Element eq 'Track')
{
my %Temp;

$TrackInfo = \%Temp;
$CurrentTrack = $Attr{Num};
return;
}
if ($Element eq 'Text')
{
return;
}

$RetVal .= "Parse error: Unrecognized element $Element\n";
}

sub ParseEnd
{
my ($Expat, $Element) = nospam at _;

remove_context($Element,\ nospam at Context);

if ($Element eq 'Id' || $Element eq 'Offset' || $Element eq 'Artist' ||
$Element eq 'TOC' || $Element eq 'Name' || $Element eq 'Title' ||
$Element eq 'NumTracks' || $Element eq 'CDInfoDump' || $Element eq 'Text')
{
return;
}
if ($Element eq 'DiskId')
{
$IdInfo->{$NumIds} = $DiskId;
$NumIds++;
return;
}
if ($Element eq 'IdInfo')
{
$CDInfo->{IdInfo} = $IdInfo;
return;
}
if ($Element eq 'SingleArtistCD')
{
$CDInfo->{SingleArtist} = $SingleArtist;
return;
}
if ($Element eq 'MultipleArtistCD')
{
$CDInfo->{MultipleArtist} = $MultipleArtist;
return;
}
if ($Element eq 'Track')
{
if (!$IsSingleArtist)
{
$MultipleArtist->{$CurrentTrack} = $TrackInfo;
}
return;
}
if ($Element eq 'CDInfo')
{
#save the assosiative array with TextInformation per track
#into the main CDInfo
$CDInfo->{TextInfo} = $TextInfo;
#print "TextInfo added.\n";

if ($RetVal eq '')
{
my $ret;

if ($IsSingleArtist)
{
$ret = AcceptSingleArtistXML($CDInfo, $dbh);
}
else
{
$ret = AcceptMultipleArtistXML($CDInfo, $dbh);
}
if ($ret ne '')
{
$RetVal .= "DB Error: $ret\n";
}
}
return;
}

$RetVal .= "Parse error: Unrecognized close element $Element\n";
}

sub ParseChar
{
my ($Expat, $Char) = nospam at _;
my $Dummy;

#print "Context: nospam at Context. ";

if (equal_context( nospam at Context, 'CDInfo', 'Title'))
{
$CDInfo->{Title} .= $Char;
return;
}
if (equal_context( nospam at Context, 'CDInfo', 'NumTracks'))
{
$CDInfo->{NumTracks} = $Char;
return;
}
if (equal_context( nospam at Context, 'CDInfo', 'IdInfo', 'DiskId', 'Id'))
{
$DiskId->{Id} .= $Char;
return;
}
if (equal_context( nospam at Context, 'CDInfo', 'IdInfo', 'DiskId', 'TOC', 'Offset'))
{
$DiskId->{$CurrentOffset} = $Char;
return;
}
if (equal_context( nospam at Context, 'CDInfo', 'SingleArtistCD', 'Artist'))
{
$SingleArtist->{Artist} .= $Char;
return;
}
if (equal_context( nospam at Context, 'CDInfo', 'MultipleArtistCD', 'Track', 'Artist'))
{
$TrackInfo->{Artist} .= $Char;
return;
}
if (equal_context( nospam at Context, 'CDInfo', 'SingleArtistCD', 'Track', 'Name'))
{
$SingleArtist->{$CurrentTrack} .= $Char;
return;
}
if (equal_context( nospam at Context, 'CDInfo', 'MultipleArtistCD', 'Track', 'Name'))
{
$TrackInfo->{Name} .= $Char;
return;
}
#JOHAN added the text tag
if (equal_context( nospam at Context, 'CDInfo', 'SingleArtistCD', 'Track', 'Text') ||
equal_context( nospam at Context, 'CDInfo', 'MultipleArtistCD', 'Track', 'Text'))
{
$TextInfo->{$CurrentTrack} .= $Char;
return;
}

$Dummy = $Char;
$Dummy =~ tr/ \n\r\t//ds;
if ($Dummy ne '')
{
$RetVal = "Parse Error: Extra character data '$Char'\n";
}
}

1;

--------------EE9BDFFEA9A0129E76C2B672
Content-Type: application/x-perl;
name="parse_con.pl"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="parse_con.pl"

#!/usr/bin/perl -w
#____________________________________________________________________________
#
# CD Index - The Internet CD Index
#
# Copyright (C) 1999 Johan Pouwelse
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
#____________________________________________________________________________

use strict;

#The function returns 1 if the argument list contains
#two sequential equal strings
# equal_context(("CDInfo","Title"),("CDInfo","Title","name")) == 0
sub equal_context
{
my ( nospam at Context) = nospam at _;
my $tel;

# Only even numbered parameters can be equal
if (($#_+1) % 2 == 0) {
#print "num $#Context\n";
# test every context tag with it's opponent
for ($tel=0;$tel < ($#_+1)/2;$tel++)
{
if ($Context[$tel] ne $Context[$tel + ($#_+1)/2])
{
return 0;
}
}
} else {
return 0;
}
return 1;
}

#The function returns 1 if the argument Context_ref contains
#the tag Element and removes it.
#When Elements is not found in Context_ref, the function
#returns 0;
sub remove_context
{
my ($Element, $Context_ref) = nospam at _;
my $found = 0;
my $tag;

foreach $tag ( nospam at $Context_ref)
{
if ($Element eq $tag || $found)
{
$found++;
}
}
for (;$found>0;$found--)
{
pop nospam at $Context_ref;
}
return $found;
}

1;

--------------EE9BDFFEA9A0129E76C2B672--