Knowbuddy's Gnutella FAQ

I don't claim that this FAQ is all-inclusive, just that it contains a written record of some of my thoughts on the subject. No, I haven't even started writing my own client. I'm more of a ... consultant ... at this phase. This FAQ is very much a work in progress, so please let me know if you know of anything that you feel I should address. You can find me on EFnet IRC in #gnutelladev, or email me at gnutelladev@rixsoft.com. The latest version of this document should always be found at <http://www.rixsoft.com/Knowbuddy/gnutellafaq.html>.

Contents

Resources

Gnutella

Online Privacy & Anonymity

A Brief Overview of the Protocol

Terminology

Most of the Internet works on a client-server basis. You, as the client, connect your machine to a server, which is normally bigger and faster than you, and you retrieve information (as files). The server rarely gets any files from you. The gnutella protocol is a bit different in that clients become servers and servers become clients all at once. The playing field is levelled and anyone can be a client or a server, no matter how big or fast they are. Since you can be both, the combination has become known as a "servent". However, to avoid confusion, I'll try to stick with the standard definitions of client and server whenever possible, to create a context.

This is accomplished by creating a sort of distributed environment. You act as a server to people who want the files on your machine, and you act as a client to access files on other people's machines. Of course, you can be just a server (by never bothering to retrieve any other files) or just a client (by not sharing any of your files), but in the spirit of openness and cooperation, you will probably end up doing a little of both. The gnutella network is made up of hundreds (eventually many thousands) of servents all chattering away at each other and sending files back and forth.

All communication is done over the TCP/IP protocol. Each piece of information is called a "packet", just like in Internet terms. More often than not, the gnutella packets coincide nicely with TCP/IP packets. Right now, the protocol uses TCP/IP only; no UDP.

Connecting

To connect to the network, you only have to know one thing: the IP address and port of any servent that is already connected. The first thing your servent does when it connects is announce your presence. The servent you are connected to passes this message on to all of the servents it is already connected to, and so on until the message propagates throughout the entire network.[1] Each of these servents then responds to this message with a bit of information about itself: how many files it is sharing, how many KBs of space they take up, etc. So, in connecting, you immediately know how much is available on the network to search through.

Searching

Searching works similarly to connecting: you send out a search request, it is propagated through the network, and each servent that has matching terms passes back its result set. Each servent handles the search query in its own way. The simple query ".mp3" could be handled in different ways by different servents: one servent might take it literally, matching anything with ".mp3" in it, while another might interpret it as a regular expression and match any character followed by "mp3". To save on bandwidth, a servent does not have to respond to a query if it has no matching items. The servent also has the option of returning only a limited result set, so that it doesn't have to return 5000 matches when someone searches for "mp3".

Since all of the searches are to the local servent's database, the servent sees what everyone else is searching for. Using this, most clients have a Search Monitor that allows the user to see, in real time, the searches that their servent is responding to.

Downloading

For file sharing, each servent acts as a miniature HTTP web server. Since the HTTP protocol is well established, existing code libraries can be used. When you find a search result that you want to download, you just connect to the servent in the same way your web browser would connect to a web server, and you are good to go. Of course, the servent has this built-in, so your normal web browser never has to enter the picture.

Servents are also smart enough to compensate for firewalls. If you are behind a firewall that can only connect to the outside world on certain ports (80, for instance) you will just need to find a servent running on port 80. Since the servents can serve on any port, you are likely going to find one that is serving on a firewall-friendly port. Also, if you are trying to download a file from a servent that is behind a firewall, you can ask the firewalled servent to push the file to you since you will not be able to connect to it directly. The only thing the protocol cannot compensate for is file transfers between two servents behind two different firewalls. In such a case, there really isn't anything that can be done.

TTL (Time To Live)

Just like TCP/IP packets, gnutella packets have a TTL (Time To Live). The TTL starts off at some low number, like 5. Each time a packet is routed through a servent, the servent lowers the TTL by 1. Once the TTL hits 0 the packet is no longer forwarded. This helps to keep packets from circling the network forever. Also, each servent has the option to arbitrarily lower the TTL of a packet if it thinks it is unreasonable. So, even if I send all my packets with a TTL of 200, the odds are that most of the servents along the way are going to just immediately knock this down to a more reasonable number. The number of servents the packet has already been routed through is also noted, and acts as a sort of reverse-TTL.

Differences from Napster

A Napster network is closer to the traditional client-server motif. A client connects to one prebuilt Napster server and no one else. All queries are routed through this central server and it is the server that does the searching and returns the result set. The server still does not host the files, though. Once you have picked out a file you want, file transfer works similar to the gnutella method.

The up-side of this is that there are many redundant servers and they are all in fixed locations, so you will always be able to connect to a Napster server somewhere. Also, the dedicated hardware for the searches is normally pretty fast and optimized, and you get your results all at once. The search language is also controlled, so you know that each server is going to treat search terms similarly. (With badly-written gnutella clones you have the possibility of "*.mp3" not actually matching anything because it doesn't support globbing.)

The down side is that the central server don't talk to each other. This means that each Napster network is a wholly separate entity, which severely limits your search options. Several of the Napster clones allow you to choose which network you want to join, but the average user still has a hit-or-miss chance of finding the same user on the same network twice. Also, if the central server is bogged down, searches can take inordinate amounts of time.

Similarities with Napster

Both Napster and Gnutella allow you to control what files you share. Gnutella takes this a bit further than Napster by also allowing you to share different types of files, but the basic principles are the same.

Servent-to-Servent Communication

Figure 001

For the following examples, I am going to use the old cryptographic examples of Alice, Bob, and Eve. I'll also throw in Charlie, who will always be between Alice and Bob, and is essentially Alice's link to the outside world. Eve is the bad hax0r script-kiddie, bent on bringing down the network. Let us say that Alice is behind a firewall, connected only to Charlie, who is then also connected to Bob. Eve is going to move around a bit, so we'll leave her floating in limbo. Refer to Figure 001.

The Packet Header

All of the packets traveling around the network have a 23-byte header that consists of the following information:

  1. MessageID, 16 bytes - A unique identifier used for tracking this specific packet. It should be unique to the network, which is to say that two servents may not generate the same MessageID (within a reasonable amount of time), and one servent should never use the same MessageID more than once.
  2. FunctionID, 1 byte - The underlying point of the message. Labels the packet as a search, or a connection announcement (initialization), etc.
  3. RemainingTTL, 1 byte - The TTL left to this packet. The originating servent sets this and each servent the packet is routed through decrements it. See TTL.
  4. HopsTaken, 1 byte - The number of servents this packet has already been routed through. This starts at 0 and is incremented by each servent.
  5. DataLength, 4 bytes - The size of the remaining data in the packet. Included so that the processing servent will know when the incoming packet ends.

Incidentally, since each connection is a unique combination of host and port, we'll track these later on using the pseudo-key ConnectionID.

Functions

The FunctionID field of the packet header tells the servent how to process the request. Valid requests are the INITialization (0x00), Search (0x80), and Client-Push Request (0x40). INITialization and Search both have responses, which set the low bit, making their values 0x01 and 0x81, respectively. A response to a Client-Push Request isn't necessary, as the receiving servent would either then push the file or not push the file to the sender. For more information on the layout of the packets for the individual functions, see the excellent documentation at <http://capnbry.dyndns.org/gnutella/protocol.html>.

Packet Routing

We'll start off with an INITialization packet, as that is how you announce your presence, and searches work in essentially the same way. When Alice connects to Charlie, she sends her INIT packet. Charlie receives it, and routes it on to Bob and Eve. At the same time, he sends back to Alice an INIT Response that tells Alice what his host IP and port are (even though Alice already knows this), how many files he is sharing, and how much space those files take up. When Bob and Eve get the routed INIT packet from Charlie, they send their own INIT Responses back to Charlie, who then forwards them back to Alice. Incidentally, Bob and Eve both forward the INIT packet on to everyone they are connected to, and so on, until the packet expires. In this way, Alice now has information about everyone her packets can reach before they expire.

The trick here is that Charlie needs to keep track of some of the messages that come his way. For this, he needs a good Routing Table. A routing table is a list of the last few hundred packets you have received, who sent them, and what they did. In this case, Charlie needs to keep track of the MessageID for the INIT packet that Alice sent, so when he gets replies from Bob and Eve he will know that they are supposed to go to Alice, and not some other person he is connected to. A really good routing table is indexed by MessageID, FunctionID, and ConnectionID, for fast lookup and just in case different clients use the same MessageID.

This is also useful because Charlie is eventually going to get the original INIT packet back from Eve, because Eve has no way of knowing that Charlie has already seen it. This isn't a problem as Charlie just looks in his Routing Table, notes that he has already processed this request, and simply drops the packet. The bigger the Routing Table, the less chance there is for propagating duplicate packets.

Also, Charlie doesn't want to keep all of the packets he has seen, just enough. So, Charlie will most likely also have a Most Recently Processed rotating pool. Basically, Charlie keeps track of the last time he got a duplicate for a fixed number of packets, 500 for example. Whenever he gets a new packet, he takes the oldest one from his MRP pool, deletes the corresponding entry from his Routing Table, and replaces both of them with the new packet. In this way, his Routing Table stays a fixed size so it doesn't eat up memory, but he's also limiting the chance of propagating duplicate packets.

Searching works on essentially the same concept: Alice to Charlie to Bob and Eve and then back again. However, obviously some additional information is passed back, such as the connection information of the hosting servent, and an array of results in a result set.

An Important Note on Anonymity and Tracking

There is one thing to note which will come into play much more later as we discuss security and spoofing: Bob cannot reliably tell whether the packets are originating from Alice or Charlie. The HopsTaken field of the header should let him know if it was Charlie (as it would be 0), but beyond that he cannot be sure, as he cannot know who else is connected to Charlie. There is also the possibility that Charlie is sending incorrect information in that field, so it really cannot be used to trace a packet back to its owner. Due to this, each servent only reliably knows about the servents it is directly connected to. Anyone else is a mystery. This is not a bug, it is a feature. In this way, Eve cannot correlate searches with any specific user or prosecute them for doing things that she considers wrong.

Known Issues with the Protocol

Search Query Spoofing

Eve's current favorite trick is to flood the network with so many search requests as to make it unuseable by slower users. Since a search packet cannot be traced back to a specific sender, there currently is no reliable method of blocking such an attack. One suggestion was to disconnect hosts that suddenly start forwarding on large numbers of search requests. This has the possibility of simply disconnecting fast users, but it's the only viable solution at this time. Imeplementation of this is going to be tricky, though, as "too fast" is going to be different for every client.

Another idea is to allow the user to tell the servent which search results are bogus. Once the servent has collected enough verifiably bogus packets from one other servent, it can disconnect from that servent. If Charlie tells his servent that enough bogus packets are coming from Eve's direction, then the servent can just assume Eve to be hostile and refuse to connect to her. Packets may still reach Charlie via other routes (through Bob), but if enough servents deny a host, eventually that host will be unable to connect to anyone. To prevent Eve from simply switching to another port, a "Ban this IP" option would probably be the way to go.

Search Result Spoofing

This is a bit harder to deal with, as an intermediary servent (Charlie) has no way of knowing that the result packets it is routing (from Eve) contain bogus data. Only when Alice connects to Eve to retrieve the file will she know that the result was spoofed. Also, there is the possibility that Eve may be returning valid file pointers, but the files aren't what she says they are, and may contain viruses, etc. Again, an adaptive system on Alice's end such that Alice's servent eventually just refuses to see anything from Eve may be the answer to this.

Protocol Enhancement Ideas

These are just some ideas that I have come up with or have been mentioned on the mailing list or IRC. They are not, by any means, in any form of implementation, they are just ideas.

Authentication and Trust

If we extend the adaptive banning system that we are using to combat search spoofing such that we allow Charlie to tell Bob that he is banning Eve, then we begin to form a trust network. If Bob trusts Charlie, he may opt to put Eve "on probation" and watch her more closely. Or, he may simply trust Charlie implicitly and immediate ban Eve, or not trust Charlie at all and simply ignore him. Currently the protocol does not support sharing trust or banning information, but it could be worked in. Even if we don't want to extend the protocol to add a specific function for this, we could do it with specialized search packets that Bob and Charlie know not to route.

Additionally, we may choose to integrate some sort of authntication protocol such that Charlie knows that Alice is indeed Alice. One person in the channel suggested PGP-style keys and another mentioned Kerberos.

Using UDP

One of the most-often asked questions is why TCP/IP is used instead of UDP. One problem with using UDP is its connectionless nature. Charlie knows that Alice and Bob are there because their connections are still up. If Charlie used UDP, he'd have a tougher time telling when Alice or Bob disconnected abruptly from the network, and would probably waste bandwidth sending at Alice and Bob when they aren't actually there. Connection build time is the argument most heard from UDP proponents, but this isn't really an issue. Since the servents stay in constant contact with each other and aren't just dropping and creating connections on the fly (normally), the overhead for connection building is minimal.

A Gnutella Proxy Server

This was first brought up on the mailing list, and then in the channel. The following is a combination of ideas by myself, Watts, and Luis Muniz.

One of the things we've been tossing around on both the list and the channel is the idea of a gnutella proxy server or gateway. We've figured out a way to do it without having to break the protocol, so theoretically, someone could implement one immediately. A couple different design goals motivated us:

We'll continue with the above diagram in which Alice is the private local network user, Charlie is the proxy servent, and Bob and Eve are users on the rest of the public network. Since we do not have to break the protocol to implement such a beast, there are varying levels of proxying that can be brought about.

First, when Alice connects to Charlie, Charlie immediately does a *.*-type search to get Alice's shared files list. From then on, Charlie treats Alice's shared files as his own. When Bob performs a search that matches one of Alice's files, Charlie spoofs a return packet that points to the file on Alice's machine. The search request never even reaches Alice, but Alice's files are searched. Then, when Alice does a search that matches a file of Bob's, Charlie has been keeping track of Alice's searches and allows that response to pass back to Alice. In this way, Alice doesn't see any of the other network traffic, just the packets pertinent to her. Of course, if Alice doesn't want to share her files with anyone on the outside, she can simply not return any files to Charlie.

If we assume that Alice only wants to connect to proxying servents like Charlie, and that Charlie is working together with these other proxy servents, then we have a way for Alice to stay behind a proxy. Upon receiving Alice's INIT packet, Charlie does not pass it on to the rest of the network, but spoofs return packets for the other proxy servers he is working with. In this way, Alice's host catcher is only populated with other proxy servers and the outside network does not even have to know that Alice exists.

If we want to add another level of complexity, such that Alice's files are not directly pointed to by Charlie's spoofed search return packets, then we can have Charlie act as a gateway for Alice and Bob. Bob does a search that matches one of Alice's files. Charlie spoofs the search response to look like he is holding the file. When Bob requests the file, Charlie simultaneously requests the file from Alice and simply shovels the data in one port and out the other. Requests from Alice to Bob work in the same manner. Alice and Bob never even have to know about each other.

This also allows Alice, who can only connect to Charlie because of her firewall, to get files from Bob. Of course, this doesn't stop Eve from spoofing search results or substituting trojans, but it does keep Eve's knowledge of Alice very limited. Also, what you then essentially have is a Napster network in which the servers/servents talk to each other.

Of course, if Alice is able to connect to other non-proxy servents outside of her private network then she certainly could still do so. However, it is in her best interests, for bandwidth reasons and anonymity, to only connect to one proxy server at a time and no one else.

Notes

  1. Actually, this isn't entirely true. Every piece of information that you send has a TTL (Time To Live) that acts as an expiration. But, for our purposes, we'll assume that every servent sees every piece of information that goes through the network.