Difference between revisions of "Nintendo - Nintendo DSi (Digital) (CDN) dat notes"

From No-Intro ~ Wiki
Jump to navigation Jump to search
(Created page with "== Background Information == '''--THIS IS A DRAFT FOR THE SOON-TO-BE-RELEASED DAT--''' The ''Nintendo - Nintendo DSi (Digital) (New)'' dat is the work of User:Hiccup and...")
 
Line 35: Line 35:
 
* ''number'': The archive number for the dat. Used primarily for parent-clone relationships.
 
* ''number'': The archive number for the dat. Used primarily for parent-clone relationships.
 
* ''region'': Determined based on the last character of the game code (last two of the title ID).
 
* ''region'': Determined based on the last character of the game code (last two of the title ID).
 +
** See [[#Region Code to No-Intro Region Map]] for the full table used.
 
* ''languages'': Determined based on a complicated method that is detailed in [[#Language Determination Method]].
 
* ''languages'': Determined based on a complicated method that is detailed in [[#Language Determination Method]].
 
* ''special1'': Set to "System" if the first character of the game code is an 'H' ('K' is for normal games, 'H' is for system titles). Otherwise, it's left blank.
 
* ''special1'': Set to "System" if the first character of the game code is an 'H' ('K' is for normal games, 'H' is for system titles). Otherwise, it's left blank.
 
* ''special2'': Set to "Removed" if the title was removed from Nintendo's CDN at the time of dat creation.
 
* ''special2'': Set to "Removed" if the title was removed from Nintendo's CDN at the time of dat creation.
 
* ''clone'': Set to 'P' if the title is considered the parent of the regional releases, otherwise it's set to the archive number of the parent title. The parent is determined as follows: the regional releases are ordered based on whether or not they contain En in their supported languages, then by how many languages they support, then by a general list of region order designed to prioritize regions with greater global/cultural 'coverage'. The first in the ordered list is then chosen. This is designed to assign the parent release as the one with the most canonical 'coverage', spanning the most languages. The reason that the presence of English is a factor is simply because No-Intro is generally English-speaking, so it makes sense for 1G1R set collectors.
 
* ''clone'': Set to 'P' if the title is considered the parent of the regional releases, otherwise it's set to the archive number of the parent title. The parent is determined as follows: the regional releases are ordered based on whether or not they contain En in their supported languages, then by how many languages they support, then by a general list of region order designed to prioritize regions with greater global/cultural 'coverage'. The first in the ordered list is then chosen. This is designed to assign the parent release as the one with the most canonical 'coverage', spanning the most languages. The reason that the presence of English is a factor is simply because No-Intro is generally English-speaking, so it makes sense for 1G1R set collectors.
 +
** See [[#Parent Determination Region Order]] for the region order used to break ties.
 
'''source'''
 
'''source'''
 
* Each title has multiple source tags. This is because files are grouped by dump date as well as by dumper. This allows the dat to fully capture when each file was dumped and datted, and more importantly, captures the work done by each person in the project.
 
* Each title has multiple source tags. This is because files are grouped by dump date as well as by dumper. This allows the dat to fully capture when each file was dumped and datted, and more importantly, captures the work done by each person in the project.
Line 56: Line 58:
 
* ''forcename'': The proper name for the file. This is the filename used to access the file on Nintendo’s CDN, and in the case of decrypted files, the filename of the source file that was decrypted.
 
* ''forcename'': The proper name for the file. This is the filename used to access the file on Nintendo’s CDN, and in the case of decrypted files, the filename of the source file that was decrypted.
 
* ''extension'': The file extension we suggest the file should have for file management and ease of use. This also indicates what type the file is, in the case of the ID files where the name is not very descriptive.
 
* ''extension'': The file extension we suggest the file should have for file management and ease of use. This also indicates what type the file is, in the case of the ID files where the name is not very descriptive.
 +
** See [[#Extension Types]] for the extensions used for each file type.
 
* ''item'': Set to "Main Content" for all main files (metadata/tmd, ticket/cetk, title content), and set to "Miscellaneous Content" for everything else. This is because the dat includes all known hidden content ID forms of files too, which are good for archival and preservation, but not very useful to people who just want to maintain ROM sets.
 
* ''item'': Set to "Main Content" for all main files (metadata/tmd, ticket/cetk, title content), and set to "Miscellaneous Content" for everything else. This is because the dat includes all known hidden content ID forms of files too, which are good for archival and preservation, but not very useful to people who just want to maintain ROM sets.
 
* ''date'': The date the file was created. For everything except decrypted content, this date is the date from the Last-Modified header straight from Nintendo’s CDN. This means the field contains the date the file was added to Nintendo’s servers. For decrypted files, this field is set to the date they were decrypted.
 
* ''date'': The date the file was created. For everything except decrypted content, this date is the date from the Last-Modified header straight from Nintendo’s CDN. This means the field contains the date the file was added to Nintendo’s servers. For decrypted files, this field is set to the date they were decrypted.
Line 80: Line 83:
 
It goes through the list of titles, and any titles that are unique are considered supported languages. In this example, both De and Fr titles will be unique, as they are in their respective languages, while the rest of the titles will be En. This allows the tool to say with relative certainty that De and Fr are supported languages, since they had their own proper localizations of titles.
 
It goes through the list of titles, and any titles that are unique are considered supported languages. In this example, both De and Fr titles will be unique, as they are in their respective languages, while the rest of the titles will be En. This allows the tool to say with relative certainty that De and Fr are supported languages, since they had their own proper localizations of titles.
  
Since the remaining titles are all copies of the En title, it’s impossible for it to know which is the primary title the others were copied from. This is where [https://github.com/ivanakcheurov/ntextcat NTextCat] comes in. NTextCat is a library designed to guess what language a piece of text is in. The tool of course sanitizes the title (removing characters and strings that serve no purpose in identifying the language) before passing it to NTextCat, to help prevent false positives and errors. Despite this, the best NTextCat can do is guess - it’s almost always correct for languages with different character sets (Ja, Zh, Ko), but it isn’t always right for Latin-alphabet languages. Part of the problem is since these are game titles, many are only 2-3 words and/or include made-up words or combinations thereof. It certainly isn’t optimal conditions for NTextCat to do it’s best.
+
Since the remaining titles are all copies of the En title, it’s impossible for it to know which is the primary title the others were copied from. This is where [https://github.com/ivanakcheurov/ntextcat NTextCat] comes in. NTextCat is a library designed to guess what language a piece of text is in. The tool of course sanitizes the title (removing characters and strings that serve no purpose in identifying the language) before passing it to NTextCat, to help prevent false positives and errors. Despite this, the best NTextCat can do is guess - it’s almost always correct for languages with different character sets (Ja, Zh, Ko), but it isn’t always right for Latin-alphabet languages. Part of the problem is since these are game titles, many are only 2-3 words and/or include made-up words or combinations thereof (ie. "Easter Eggztravaganza"). It certainly isn’t optimal conditions for NTextCat to do it’s best.
  
 
Because of this, the tool does more work with NTextCat’s guess. The identifier doesn’t just give one answer, it gives all the languages it thinks the title could be in, in order of probability (the language it thinks is most likely is first). That list is then filtered to remove languages that have already been added, ones that make no sense for the game’s region, etc. It then chooses the most likely language that NTextCat thought was possible that is also an expected language for the region. In the case of system titles (title ID first 8 characters not being equal to "00030004"), languages that are not an option for the region (in the DSi settings) are removed as well.
 
Because of this, the tool does more work with NTextCat’s guess. The identifier doesn’t just give one answer, it gives all the languages it thinks the title could be in, in order of probability (the language it thinks is most likely is first). That list is then filtered to remove languages that have already been added, ones that make no sense for the game’s region, etc. It then chooses the most likely language that NTextCat thought was possible that is also an expected language for the region. In the case of system titles (title ID first 8 characters not being equal to "00030004"), languages that are not an option for the region (in the DSi settings) are removed as well.
  
 
Despite all the logic behind it, at the end of the day it remains an educated guess. For this reason, the No-Intro entry has two additional pieces of information that go alongside the language info. The first is the '''Language Determination Method''', which simply states whether the information was gleaned from the 3DS eShop page of an official 3DS port of the game, or whether it was guessed from the ROM titles as described above. The second, in the case of the ROM title guessing, is the '''Nebulously-Determined Languages'''. This tells you the languages that were determined by NTextCat, as they have a much higher chance of being incorrect than ones that existed as unique titles. This is to hopefully make it easier for errors to be found and corrected in the future, after the dat has been added.
 
Despite all the logic behind it, at the end of the day it remains an educated guess. For this reason, the No-Intro entry has two additional pieces of information that go alongside the language info. The first is the '''Language Determination Method''', which simply states whether the information was gleaned from the 3DS eShop page of an official 3DS port of the game, or whether it was guessed from the ROM titles as described above. The second, in the case of the ROM title guessing, is the '''Nebulously-Determined Languages'''. This tells you the languages that were determined by NTextCat, as they have a much higher chance of being incorrect than ones that existed as unique titles. This is to hopefully make it easier for errors to be found and corrected in the future, after the dat has been added.
 +
 +
If this explanation wasn't enough or you want to see the code for yourself, it can be found [https://github.com/zedseven/NusRipper/blob/master/NusRipper/Primary/DatOMatic.cs here]. The relevant section is under the '''// Languages''' header.
 +
 +
 +
== Additional Tables ==
 +
=== Region Code to No-Intro Region Map ===
 +
Much of this was sourced from the last part about the fourth character in NDS game codes at [https://problemkaputt.de/gbatek.htm#dscartridgeheader GBATEK], though with some adjustments.
 +
{| class="wikitable"
 +
! Code !! Region Name
 +
|-
 +
| style="text-align:center;" | E || USA
 +
|-
 +
| style="text-align:center;" | J || Japan
 +
|-
 +
| style="text-align:center;" | P || Europe
 +
|-
 +
| style="text-align:center;" | U || Australia
 +
|-
 +
| style="text-align:center;" | K || Korea
 +
|-
 +
| style="text-align:center;" | V || Europe, Australia
 +
|-
 +
| style="text-align:center;" | C || China
 +
|-
 +
| style="text-align:center;" | D || Germany
 +
|-
 +
| style="text-align:center;" | F || France
 +
|-
 +
| style="text-align:center;" | I || Italy
 +
|-
 +
| style="text-align:center;" | S || Spain
 +
|-
 +
| style="text-align:center;" | O || USA, Europe
 +
|-
 +
| style="text-align:center;" | X || Europe
 +
|-
 +
| style="text-align:center;" | T || USA, Australia
 +
|-
 +
| style="text-align:center;" | H || Netherlands
 +
|-
 +
| style="text-align:center;" | A || World
 +
|}
 +
 +
=== Parent Determination Region Order ===
 +
Keep in mind this order is used only to break ties when sorting by language count is not enough.
 +
{| class="wikitable"
 +
! Code !! Region Name
 +
|-
 +
| style="text-align:center;" | A || World
 +
|-
 +
| style="text-align:center;" | V || Europe, Australia
 +
|-
 +
| style="text-align:center;" | P || Europe
 +
|-
 +
| style="text-align:center;" | O || USA, Europe
 +
|-
 +
| style="text-align:center;" | T || USA, Australia
 +
|-
 +
| style="text-align:center;" | E || USA
 +
|-
 +
| style="text-align:center;" | U || Australia
 +
|-
 +
| style="text-align:center;" | X || Europe
 +
|-
 +
| style="text-align:center;" | F || France
 +
|-
 +
| style="text-align:center;" | D || Germany
 +
|-
 +
| style="text-align:center;" | S || Spain
 +
|-
 +
| style="text-align:center;" | I || Italy
 +
|-
 +
| style="text-align:center;" | H || Netherlands
 +
|-
 +
| style="text-align:center;" | J || Japan
 +
|-
 +
| style="text-align:center;" | K || Korea
 +
|-
 +
| style="text-align:center;" | C || China
 +
|}
 +
 +
=== Extension Types ===
 +
{| class="wikitable"
 +
! Content Type !! Extension
 +
|-
 +
| Encrypted content || bin
 +
|-
 +
| Decrypted executable ROMs || nds
 +
|-
 +
| Decrypted content of some other type (ie. Nintendo DS Cart Whitelist) || bin
 +
|-
 +
| Metadata (including special content ID "form") || tmd
 +
|-
 +
| Ticket (including special content ID "form") || tik
 +
|-
 +
| DSi Shop content (based on file magic) || gif, bmp, zip
 +
|-
 +
| Unknown empty special content ID, and everything else || bin
 +
|}

Revision as of 14:59, 17 November 2020

Background Information

--THIS IS A DRAFT FOR THE SOON-TO-BE-RELEASED DAT--

The Nintendo - Nintendo DSi (Digital) (New) dat is the work of Hiccup and zedseven, with additional work by Larsenv and data sourced from Galaxy. It was created by a tool written specifically for this purpose, NUS Ripper.

The dat includes all known CDN files for DSiWare, including:

  • Metadata files (tmd)
  • Tickets (cetk)
  • Encrypted content
  • Decrypted content (verified by hash with hashes contained in metadata)
  • DSi Shop data (FFFD0000 and up):
    • Icons
    • Screenshots
    • Zipped HTML manuals for system titles
  • All content ID 'versions' of above files:
    • FFFEFFFF and down - same as metadata
    • FFFFFFFD - same as ticket, if available
    • FFFFFFFE - 0-byte file that always exists for every title


Field Sources

This is a breakdown of how each field in the dat is sourced, with relevant information if necessary.

The format for the following section is:

tag

  • field: Explanation & source.


game

  • name: The No-Intro title of the ROM.

archive

  • name: Same as game name.
  • number: The archive number for the dat. Used primarily for parent-clone relationships.
  • region: Determined based on the last character of the game code (last two of the title ID).
  • languages: Determined based on a complicated method that is detailed in #Language Determination Method.
  • special1: Set to "System" if the first character of the game code is an 'H' ('K' is for normal games, 'H' is for system titles). Otherwise, it's left blank.
  • special2: Set to "Removed" if the title was removed from Nintendo's CDN at the time of dat creation.
  • clone: Set to 'P' if the title is considered the parent of the regional releases, otherwise it's set to the archive number of the parent title. The parent is determined as follows: the regional releases are ordered based on whether or not they contain En in their supported languages, then by how many languages they support, then by a general list of region order designed to prioritize regions with greater global/cultural 'coverage'. The first in the ordered list is then chosen. This is designed to assign the parent release as the one with the most canonical 'coverage', spanning the most languages. The reason that the presence of English is a factor is simply because No-Intro is generally English-speaking, so it makes sense for 1G1R set collectors.

source

  • Each title has multiple source tags. This is because files are grouped by dump date as well as by dumper. This allows the dat to fully capture when each file was dumped and datted, and more importantly, captures the work done by each person in the project.

details

  • dumpdate: The date this group of files was downloaded and/or decrypted. It applies to every child rom tag.
  • knowndumpdate: This is set to '1' for every entry by both zedseven and Galaxy, but the dump dates for Larsen's entries aren't known for sure, and as such his are set to '0'.
  • releasedate: Set to the same thing as dumpdate. This isn't known, so the value acts as a default.
  • knownreleasedate: Set to '0'.
  • dumper: Set to the username of the person that source tag is for. The entries with dumper="zedseven" contain the most data, but it was important to include the work of Galaxy and Larsen, since their work helped in the creation of the dat.
  • tool: Set to "NUS Ripper vX.X.X" in the case of dumper="zedseven", and "Custom" otherwise.
  • origin: Always set to "CDN", since the source for all of the information is Nintendo's CDN.

serials

  • digitalserial1: The title ID of the title. (16 digit hex number)
  • digitalserial2: The game code of the title, calculated from the last 8 digits of the title ID (each two digits are an ASCII character). The game code is what would be printed on the back of a cartridge - a 4-character code consisting of letters and numbers, that contains information on the type of title it is, and what region the release is for.

rom

  • forcename: The proper name for the file. This is the filename used to access the file on Nintendo’s CDN, and in the case of decrypted files, the filename of the source file that was decrypted.
  • extension: The file extension we suggest the file should have for file management and ease of use. This also indicates what type the file is, in the case of the ID files where the name is not very descriptive.
  • item: Set to "Main Content" for all main files (metadata/tmd, ticket/cetk, title content), and set to "Miscellaneous Content" for everything else. This is because the dat includes all known hidden content ID forms of files too, which are good for archival and preservation, but not very useful to people who just want to maintain ROM sets.
  • date: The date the file was created. For everything except decrypted content, this date is the date from the Last-Modified header straight from Nintendo’s CDN. This means the field contains the date the file was added to Nintendo’s servers. For decrypted files, this field is set to the date they were decrypted.
  • format: Set to "CDN" for all content downloaded straight from the CDN. For decrypted contents, it is set to "CDNdec". For each decrypted content file, there is a copy of the CDN version of the related metadata and ticket, both with their formats also set to "CDNdec". This is so one can download a purely usable dat of decrypted contents, and their relevant files to go along with them.
  • version: Set to the version the file is for, if such information is relevant. For instance, all metadata and ticket files include a version in them. This is where the version comes from, and the information is included in "raw,pretty" format - each Nintendo version can actually be broken down into a ‘pretty’ version that is of the form "vX.X.X".
  • size: File size in bytes, calculated at time of dat creation.
  • crc: CRC32 hash of file contents.
  • md5: MD5 hash of file contents.
  • sha1: SHA1 hash of file contents.
  • sha256: SHA256 hash of file contents.
  • serial: For title content files, this is set to the game code, though this game code is sourced from the actual ROM, not the title ID. There should never be a difference, but if there was, it’d be captured this way. This also contains the 12-character internal name of the title - often one or two words, acting as a kind of short title. The game code and internal title are separated by a comma.


Language Determination Method

In the No-Intro dat, each title has a list of supported languages. These languages were automatically determined. This section will detail precisely how they were determined, as well as how to interpret the additional fields related to it.

For each ROM, the tool checks to see if it has an existing 3DS port. A lot of DSiWare titles were later ported over to the 3DS eShop, and the 3DS eShop often contains data on what languages a game supports. If the title has an existing 3DS port and the eShop has language information for it, the tool stops here (as it has a canonical list of supported languages for the title).

If it was unable to find that information, it then moves on to ‘guessing’ the supported languages from the titles (system menu names) contained in the ROM. Nintendo DS games, and by extension, DSiWare, contain Japanese, English, French, German, Italian, Spanish, Chinese, and Korean titles. Some or all of them can be populated, each with the game’s title localized to the language. For languages a game doesn’t support, the title will either be empty or a duplicate of the title from the game’s primary language (typically English, but not always).

The way it figures out what languages a ROM supports from this is as follows - for the sake of example, let's say our ROM has De and Fr titles, and the rest of the titles are duplicates of the En title:

It goes through the list of titles, and any titles that are unique are considered supported languages. In this example, both De and Fr titles will be unique, as they are in their respective languages, while the rest of the titles will be En. This allows the tool to say with relative certainty that De and Fr are supported languages, since they had their own proper localizations of titles.

Since the remaining titles are all copies of the En title, it’s impossible for it to know which is the primary title the others were copied from. This is where NTextCat comes in. NTextCat is a library designed to guess what language a piece of text is in. The tool of course sanitizes the title (removing characters and strings that serve no purpose in identifying the language) before passing it to NTextCat, to help prevent false positives and errors. Despite this, the best NTextCat can do is guess - it’s almost always correct for languages with different character sets (Ja, Zh, Ko), but it isn’t always right for Latin-alphabet languages. Part of the problem is since these are game titles, many are only 2-3 words and/or include made-up words or combinations thereof (ie. "Easter Eggztravaganza"). It certainly isn’t optimal conditions for NTextCat to do it’s best.

Because of this, the tool does more work with NTextCat’s guess. The identifier doesn’t just give one answer, it gives all the languages it thinks the title could be in, in order of probability (the language it thinks is most likely is first). That list is then filtered to remove languages that have already been added, ones that make no sense for the game’s region, etc. It then chooses the most likely language that NTextCat thought was possible that is also an expected language for the region. In the case of system titles (title ID first 8 characters not being equal to "00030004"), languages that are not an option for the region (in the DSi settings) are removed as well.

Despite all the logic behind it, at the end of the day it remains an educated guess. For this reason, the No-Intro entry has two additional pieces of information that go alongside the language info. The first is the Language Determination Method, which simply states whether the information was gleaned from the 3DS eShop page of an official 3DS port of the game, or whether it was guessed from the ROM titles as described above. The second, in the case of the ROM title guessing, is the Nebulously-Determined Languages. This tells you the languages that were determined by NTextCat, as they have a much higher chance of being incorrect than ones that existed as unique titles. This is to hopefully make it easier for errors to be found and corrected in the future, after the dat has been added.

If this explanation wasn't enough or you want to see the code for yourself, it can be found here. The relevant section is under the // Languages header.


Additional Tables

Region Code to No-Intro Region Map

Much of this was sourced from the last part about the fourth character in NDS game codes at GBATEK, though with some adjustments.

Code Region Name
E USA
J Japan
P Europe
U Australia
K Korea
V Europe, Australia
C China
D Germany
F France
I Italy
S Spain
O USA, Europe
X Europe
T USA, Australia
H Netherlands
A World

Parent Determination Region Order

Keep in mind this order is used only to break ties when sorting by language count is not enough.

Code Region Name
A World
V Europe, Australia
P Europe
O USA, Europe
T USA, Australia
E USA
U Australia
X Europe
F France
D Germany
S Spain
I Italy
H Netherlands
J Japan
K Korea
C China

Extension Types

Content Type Extension
Encrypted content bin
Decrypted executable ROMs nds
Decrypted content of some other type (ie. Nintendo DS Cart Whitelist) bin
Metadata (including special content ID "form") tmd
Ticket (including special content ID "form") tik
DSi Shop content (based on file magic) gif, bmp, zip
Unknown empty special content ID, and everything else bin