Directories and Folders
The path component of the URL generally reflects the actual file structure present on the server. There is no requirement that this be the case, but it is most often true anyway. The slash character, /, is typically used to divide the whole path into sections. The first section identifies a folder or directory within the document root. The second section identifies a folder or directory within the first section. The last section identifies a specific file within the directory specified by the preceeding sections.
The Document Root: public html
The user never knows exactly what is on the server, or where it is stored. All the user really knows is (a) I presented a particular URL, and (b) the server sent back this content. In Apache, the content lives inside a tree-structured file system, with one part of it being designated as the document root. In this example: docs/2013/01/file.html, the docs folder is located in the document root. Almost always, the document root is public_html and it is located in the home directory of the website owner. Apache lets you change that, but nobody does
index.html, index.htm, index.cgi
If the final component of the path is a directory, and not a file, the Apache webserver will automatically produce an index, on the fly, unless you tell it not to. This can be handy. You can have a directory full of files, and when you visit that directory you can see a list of all those files. Normally they are linked for easy access. There is a hugely important exception. If one of the files is named index, as in index.html or index.htm or index.cgi, Apache will assume that file will do the work of showing the index. Apache will simply show that file instead of creating an index.
Many users know about the automatic index generation. When they see a path, they may change the URL by deleting the last part of the path. Then they submit it and see what comes back. This is called directory browsing. This method can frequently be used to gain access to webpages that have not yet been linked, and are perhaps not yet meant to be visible.
If you care about this for your own website, you should always provide an index.html file in every directory that users might discover. You can put Options -Indexes in your .htaccess file to explicitly turn of index generation.
robots.txt and Search Engines
Search engines like Google, Yahoo, and Bing, traverse the web using web crawlers. Web crawlers are programs that pretend to be browsers, but in reality they look at each page they can find and store it in an index for later use. Search engines are capable of using any webpage they can find, including ones that you might not want them to notice. Mainly they should index content that is stable for the long term. Things that change daily are not really good for indexing.