body { background-color: #ffffff; } .chg { margin-left: 3em; }

ODP Data Dumps

General Description



The Open Directory Project distributes its data in an RDF-like XML format. These data files are generated on a regular basis and may be downloaded from this directory.

Errata



Unfortunately, the data dump format used by the Open Directory Project is not optimal. In particular it is much longer than it needs to be, and is not in standard RDF format.



Because of this, the format will need to change from time to time. We will note the changes in this file until we have a better method for notifying users of the changes.

Changes



2004-04-26

In addition to link tags, we now have pdf, rss, and atom tags for non-html content. All RSS/Atom feeds are validated before we add them to the directory. In the next few weeks, you'll see a separate rss.rdf.u8.gz file that will contain only categories and rss/atom feeds.



2003-05-29

ExternalPage objects now contain a topic tag that reflects the category this listing belongs in.



2003-05-20

A few new tags have been to the RDF files. There is a tag dispname which gives the "display name" for each category. In addition, several new aol-specific tags have been added: aolsearch , aolshopping , aolkeyword , aoltopictype , and aolcattype .



2003-03-12

More error-checking and filtering on the editor data entry forms have been added to prevent UTF-8 and XML character encoding problems. The data dumps should now contain no illegal UTF-8 byte sequences, no illegal XML characters, and only well-formed XML. Each data dump is now tested for errors. The error report contains specific results for each of these three types of errors.



2002-07-23

Data in the Netscape/ tree is no longer included in the main RDF dump. Instead, it is provided in these files:

netscape-content.rdf.u8.gz netscape-structure.rdf.u8.gz netscape-terms.rdf.u8.gz

2000-11-20

A few additions have been made to the RDF files. There is a tag for altlang categories which behaves exactly like the symbolic tags. Tags have also been added for mediadate and ages (for Kids_and_Teens) with regards to URLs if these qualities exist for each URL.



1999-12-09

The RDF files are going UTF-8. I hope that this will clear up a lot of the problems that some users have been having with the format. If you notice any problems, please send mail to truel@dmoz.org.

We will continue to generate the current RDF files until at least January 8, 2000. We will be generating UTF-8 files periodically until that date. After January 9, all rdf files will be in the UTF-8 character set.

N.B. Some languages may have some incorrect characters. More precisely some of our categories do not have a character set associated with them yet, and so I am converting them to UTF-8 as though they were encoded in ISO-8859-1. Please do not send me email if you think you know what character set a given language should be in, but only if you know what character set the given ODP category is in.



1999-08-25

I have created an eGroups.com mailing list to announce changes to the rdf format. To sign up, fill your email address in the following form:

Subscribe to Announcement group
Enter your e-mail address:
odp-rdf-announce archive
An e-group hosted by Yahoo! Groups


1999-08-24

Now provide redirect.rdf.u8.gz which lists categories which have been moved and where they have been moved to. This should obviate your need for the catmv.log.gz file.

Redirections here are pre-chained. That is if a category has moved many places, the redirection listed is the first one that actually hits a category. If someone moves a directory around and someone else creates a directory at one of the intermediate locations, the newcomer is the redirection listed.



1999-07-29

Character escaping is being done inside all fields now, not just in Titles and Descriptions. The following four characters are being quoted, so you will have to unquote them when converting to html:

&&
<&lt;
>&gt;
"&quot;

High byte characters and non-printing control characters are also being quoted now. I have decided against utilizing actual character quoting (ie. &#21ae;) since supporting full unicode is beyond the capabilities of some of our customers. Instead the hex value of the these characters will be presented, and if you wish to convert to unicode, you will have to keep track of the charset for the given category.

As an expamle, the byte value of 200 will be presented as &xC8; whether that character was from the 8859-1 character set (&#C8; or È or &#C8;) or from 8859-2 (Č or &#x010C;) or from any other character set.



1999-05-18

Symbolic links that have been separated from the rest of the subcategories now have the link type "<symbolic1 ...>". This is exactly analogous to "<narrow1 ...>" (for separated subcategories).

TheSnowman

Help build the largest human-edited directory on the web.
Submit a Site - Open Directory Project - Become an Editor