Project 5 Golang Concurrent Cached File Server

Deadline: Friday, May 3rd at 11:59pm

NOTE

This spec is meant to be more of a specification for what we want from your project. We know that you may run across bugs or questions as you tackle this project so here are a few good resources which you can use to attempt to solve some of the issue you may come across: Google, StackOverflow, or Golang Docs. And yes, Google is a great source for answering questions! It can at least lead you to another location which may help solve your issues! We want you to try your best to use these resources which you have instead of just relying on TA’s to answer all of your questions as you may not always have someone to ask for answers (partially since you may be working on things which are new to the industry). To help encourage you to use these resources, TA’s may not respond to posts within 12 hours because we want you to try your best to search for the answers yourself. However we understand not every problem is easy to solve so we will try to answer your question to the best of our ability after that initial delay. Good luck and you can do it!

Objectives

  • TSW learn how to make a basic file server.
  • TSWBAT make and use a resource which does not need locks for concurrent accesses.
  • TSW learn what to do when a request may timeout in a system.

Getting Started

Completing Lab 12 is highly recommended since it contains the setup information for GoLang and also a quick tutorial for how to program in GoLang.

Please accept the GitHub Classroom assignment by clicking this link. Once you’ve created your repository, follow the instructions below.

Setup

Run this command to get the userlib which you will be using in this project:

 go get github.com/61c-teach/sp19-proj5-userlib

Next lets run the command to get your personal repo AFTER YOU HAVE ACCEPTED THE GITHUB CLASSROOM LINK:

mkdir -p $GOPATH/src/github.com/61c-student
cd $GOPATH/src/github.com/61c-student
git clone https://github.com/61c-student/sp19-proj5-YOUR_GITHUB_USERNAME.git

Don’t forget to replace YOUR_GITHUB_USERNAME with your github username which was used to make the github repo.

Also if you do not have $GOPATH defined, please take a look at the Go section to see how you should define it.

You’ll also need to add the starter code remote with the following command:

git remote add starter https://github.com/61c-teach/sp19-proj5-starter.git

If we publish changes to the starter code, retrieve them using git pull starter master.

Any updates and important information will be recorded on Piazza, so check there to see if there are any changes for you to pull.

Background

File servers are use all over the internet to host data which you may be searching for. A file server uses some protocol to connect to other computers on a network. When it gets a request from some other computer on the network, it processes the request (typically as a path to a file in the form of a URL), and returns back the data of the file which is local on the server. The protocol allows for a well defined communication method to exist between the different computers on the network. The common protocol which used to be used everywhere for this is http. Many websites lately have been using https due to the many security issues http has. You do not need to understand how these protocols actually work though if you want to learn more about them, you can take CS 161 Computer Security and CS 168 Introduction to the Internet: Architecture and Protocols.

The problem for many of these sites is that disk reads are really slow so when many users are trying to access a file, it can take a long time for the request to get a response. To combat this, many file servers implement caching so that subsequent reads to the same file can be faster.

One problem a server may face is a lot of disk reads at once. This can make the response to a request take a long time. Sometimes the disk may just never respond though is highly unlikely. Because of this, servers implement a timeout system to ensure that requests are answered.

Another problem which webservers face is security vulnerabilities! Often, there are attacks (like buffer overflows) which can give an adversary access to the host machine but there are even some simpler attacks which servers must defend against. You can learn more about that in CS161. The simplest attack on a file server is a directory traversal attack. For example, say I stored a text file containing the passwords to log into the computer at ~/passwords.txt and hosted my file server at ~/public_html/. If you did not sanitize the input of a request, an attacker could use the ../ to go to the parent directory and then look for your passwords file. So if I requested the file /../passwords.txt, I would see the list of password since it would look in the parent directory of the web server. This issue can be hard to mitigate but there are some methods you can add to mitigate this issue.

If you like to learn more about security flaws and how you can prevent them, you should take CS161! It is a great course to take after CS61c and very interesting to see the real world security issues with computing.

Go

You will be using the Go programming language for this project. Go is a good choice for this sort of thing for a number of reasons. Firstly, Go was designed from the ground up to support concurrency. Second, Go has very robust and easy-to-use testing tools. You should expect to write at least as many lines of unit tests as you do application code. Finally, Go is a fairly new language that tries to find a happy medium of performance (through compilation) and safety (through strict typing and garbage collection). Go is very similar to C in many ways so it should look fairly familiar to you (like learning Italian if you already know Spanish). However, it is also not so similar to C in many ways so your first task in this project will be to learn Go! Fortunately, there are excellent tools for this. I’ll list a few here:

1) Basic Installation Instructions: Go is available on the Hive machines, but you’ll probably want to get it working locally to ease development. 2) Overview of Go tools and other practical things: Go is very particular about how files are layed out on your computer and a few environment variables, be sure to read this whole document before trying to write your own Go code! Of particular importance is making sure you have the environment variable $GOPATH set correctly since all the go tools (and these instructions) rely on this. This will be set automatically on the Hive machines, but you’ll need to set it up on your local machine. 3) Go Language Tutorial: While you’re encouraged to go through this entire tutorial, you probably won’t need all of it. You should focus on the first two sections of “Basics” (“packages, variables, and functions” and “Flow control statements”) and “Concurrency” (the main focus of this project). hint: The word “Go” isn’t easily googled, so you should use the term “golang” in any searches.

Your Task

In this project, you will be implementing a simple file server with file caching where the cache has a limited size. You will be getting http requests for a file. An http request is a protocol of transporting data between computers. You do not need to understand the underlying components of this protocol other than just the high level abstraction which is given to you by go. The request will contain a path to the file which the requester wanted data back from. You will need to search for the file in your cache or on disk if it is not in the cache. Finally once you get the data of the file, you will need to respond to the requester by ‘writing’ the data as response. If you had requested the file from disk, you will also insert the data into your cache so subsequent reads will not look at disk for the file. We mentioned earlier different vulnerabilities which file servers have. We will be testing for some basic request filename string sanitization. We will replace some known sequences of characters which can traverse directories in attempts to prevent the issue. This can be hard as you have to make sure your replacement does not cause a different directory traversal string to be inserted!

One important caveat to your task is that you are not allowed to use locks! Instead you will be exploring how you can use GoLang’s channels to make it so that there is only ever one thread modifying the cache at a time.

You are not allowed to use any additional imports! Also do not change the current imports or you will get a 0!

At this point, you should take a look at the skeleton code to get an idea of the control flow of your server. Also take a look at the userlib as well. The only functions you should use in it to write the server are ReadFile and GetContentType. You may use some of the other functions useful when you write tests for your project.

Part 1: The default handler

The role of our handler function is to handle requests made to our server by parsing the path to the file and responding with the data of that file. The function gets called whenever there is a web request to any url that is not /cache/ or /cache/clear. This is because we launch a listener for connections in the main function which will relay requests which it gets to the corresponding handler to handle the request. Currently, the handler function always requests from disk the url passed into it. The function takes in a http.ResponseWriter and a http.Request. The http.Request contains some metadata about the http request plus the path to the file we are looking for. For this project, we will only care about the URL path from there. The http.ResponseWriter is a structure which abstracts away the http response you will be making. There are a few things you should know about the to arguments:

http.Request

  • The path of the file is r.URL.Path. This is all you should need the http.Request for.

http.ResponseWriter

  • w.Header(): Returns the Header object which is used for the HTTP request. We only need to call it to set the Context Type. An example of this is given in the skeleton code.
  • w.WriteHeader(int statusCode): Takes in an int to send back as the response status of a request. An example of this is when you get a 404 error when you make a request to a file which does not exist.
  • w.Write([]byte): This takes in a byte slice and appends it to the data response. It will append it to the current data so multiple calls to w.Write will just have the data written multiple times.

You will be modifying the handler function to make requests from the cache. If the file is not stored in the cache, you will request the file from disk using userlib.ReadFile. You will need to cache the response if it succeeds and respond to the user. If you get a response which contains an error, you need to ensure that you reply back with the correct error. You may want to use getFile here instead to assist you in handling these requests.

There are two types of errors you have in this project. You can get a Timeout Error where the disk took too long to respond. You can also get a File Error where there was some error returned when you tried to read the file on disk.

You should use the function http.Error if you get back an error for the file request. It takes in a http.ResponseWriter, a description string, and a return code. Here is what you should respond with if you get an error:

  • If you receive a timeout error, you should pass into the http.Error function the message userlib.TimeoutString with the return code as userlib.TIMEOUTERRORCODE.
  • If the error is not a timeout, make sure you JUST pass into the http.Error function the string userlib.FILEERRORMSG with the return code userlib.FILEERRORCODE.

If there is NO error, make sure you set the http.ResponseWriter.WriteHeader to userlib.SUCCESSCODE. The skeleton code already does this for you. You will also need to make sure you select the correct content type depending on the file extension. This is another thing which is done by the skeleton code for you. You may need to modify that if you get a different filename. We have provided a helper function userlib.GetContentType which will take in a string and return the Context Type based on the file extension.

Make sure you do all of the response handling specified above or you will not pass any of the tests due to how we are testing your code! Finally, you should make sure to write back the response which you got!

Part 2: Filename sanitization

For this part, we will just be trying to mitigate the directory traversal attack since we do not want an attacker to steal all of the sensitive data on your brand new file server! For example, say you were working for an insurance company and was tasked with creating a file server which your clients could use to view some information about the company. If you also stored client personal data is a database located in a separate directory on the same computer, an attacker could send a request to your website to go the database to access all of the password, social security numbers, etc. which it contains! In practice:

Don’t roll your own security!

Nick Weaver

This is a common quote from Nick Weaver since it is easy to make mistakes which can lead to you leaking information! Instead you should use some libraries which have been specifically build to defend against attacks. The reason we are having you do this is so that you have some practice in one method you can use to mitigate against this kind of attack.

When you get a file path from the request, it will contain a leading /. The problem is we want the request to be relative but the request is being made to the root of your filesystem! To make a path relative, you need to make the request start with ./. This means that if you get a request to /secrets/examfile.pdf, you will turn the file path to this: ./secrets/examfile.pdf

In addition to this, if you ever see /../, \/, or //, you should replace this with just a single / so that we can prevent an attacker from going up a directory or accessing the root directory. The reason why this works is because we are removing all of the methods of traversing towards the root of our filesystem. This allows us to make sure all of our requests will stay in the directory which we specified hosted our files.

We want our file server to be easily used from a web browser like Chrome or Firefox without the user having to type in a really complex url to access a file. To help assist the user access files, we will add support for searching for an index.html file. If we ever get a request to a directory, we should make sure we return the index.html which may or may not be located in that directory. For example if I made a request to https://cs61c.org/, I would really be requesting the file named https://cs61c.org/index.html. Or say I was looking for https://cs61c.org/secret/, the request would be look for https://cs61c.org/secret/index.html. In short, you should make sure you look for the index file if your path ends with /. Note that if I was to access https://cs61c.org/secret, I would really be just looking for a file named ./secret and not the index file in a directory named secret. To do the latter you would have to make a request to https://cs61c.org/secret/.

Part 3: The Cache

This part should go in the function operateCache. The general ideas is that your cache will get a file request and then have to return the data of that file if it exists in the cache. If the data does not exist in the cache, you should asynchronously fetch the file using userlib.ReadFile so that you can still handle other files which may be in the cache. The main reason why we are doing this is because disk accesses are really slow. We have emulated a really slow disk in the userlib by taking some time to service requests to files. When you implement a basic caching, you should see your files take some time to load the first time but then be really quick to load later accesses.

Here are additional criteria for the cache:

  • Your cache should perform correctly for an arbitrary number of requests at the same time.
  • You will want to use the fileChan channel to receive requests in your cache. Take a look at the structure as it holds a channel which you will use as the response to a request.
  • You must never allow the cache to go over the specified capacity capacity. (Hint: Consider what should happen if the file is larger than the cache’s capacity!)
  • If your cache will exceed its capacity when the new file is inserted in it, you can use any eviction method you want so long as the new file is NOT chosen by the eviction policy. You may need to evict multiple files. You should not just clear the cache for any eviction as your solution may be too slow to pass some of the tests.
  • You may only evict files if a new file is being added to the cache and it will exceed the capacity.
  • The size of the file is only based on the size of the data. This means the size does NOT include the cache entry structure or the filename size. You can use the len function on a slice to get the size.
  • If a file read responds with an error, you must not cache the error.
  • If you have two requests come in where the first has to fetch the file from disk while the second has to get the file from the cache, you must make sure that the disk request does not block the cache request. You should make sure you start processing the requests in the order which they come in.
  • You can have concurrent disk reads.
  • If a read takes longer than the time specified by timeout, you must return a timeout error right after the timeout time has passed. If you then receive the file back after returning a timeout, you must insert it in the cache. This means that a thread that timed out makes another request to the same file after the ReadFile has returned and not been evicted yet, it should get the item from the cache and not perform another read.
  • If you receive a request on the cacheCloseChan channel, you must clear your cache, clear any global variable your cache may have set up, and return from the operateCache function.
  • If you receive a request on the cacheCapacityChan channel, use the string template given to you in the userlib (userlib.CapacityString) to return the number of entries in your cache, the current size of your cache, and the max capacity allowed in your cache. Note that it is in this order. You may file the fmt.Sprintf function helpful.

We are leaving the exact method of doing this up to you! We have added some comments in the code which will help guide you, but you can choose whether or not you follow those comments. Just ensure that your implementation follows the expected results specified in this spec so that the autograder is able to grade your submission.

Some other notes

  • Do not change the the names of the following things: handler, cacheHandler, cacheClearHandler capacity, workingDir, timeout, operateCache; if you do the autograder will not be able to grade your submission.
  • Use the strings and numbers specified in the userlib (you should be calling those variables) or else you may not get credit for your submission.
  • You may not add imports! Notice that the _ flag in front of some the current imports is so that you are able to compile without using those imports. You may remove that flag if you end up using the import.
  • All of you work should go in the server.go file. We will not be grading your server_test.go file. You may not make any modifications to the userlib as we will be using our own custom userlib for testing.
  • The Content Type can be set to anything if you are returning an error (as it does not get checked by the autograder).
  • You can print to STDOUT as the autograder just realies on the go test framework.
  • Although there is no explicit timeout on the handler functions (as the timeout given is JUST for disk reads), they must be efficient and complete in a reasonable amount of time.
  • You should make sure the handler function does not return before the file is cached (or in the process of being cached).

Tips

  • Although you can code GoLang in any IDE or text editor, I would recommend that you use GoLand since it will make it much easier for you to see errors in your code. Since you are all students, you can get this for free so it may be helpful especially if you like Intellij.

How to run your project

Once you have cloned your files and ran the go get https://github.com/61c-teach/sp19-proj5-userlib.git, cd into your projects main directory.

From there you can run the command go run server.go

This will run your server with the default parameters set.

Here are some of the parameters you can set:

  • -p - This specifies which port you want the server to use. The default is 8080.
  • -c - This specifies the size of your cache in bytes. Default is 100000.
  • -d - This specifies which directory you want your server to run in. Default is public_html
  • -t - This specifies the timeout in seconds. The default is 2 seconds.

For example, say I wanted to run on port 8080 with 100 bytes of the cache and a 3 second timeout in the public_html directory, I would run the following command:

go run server.go -c 100 -t 3

After you have launched your server, you should see the following message:

Server starting, port: 8080, cache size: 100, timout: 3, working dir: 'public_html/'

You should then be able to use your browser to access the website at: http://localhost:8080

Note that the localhost url means that you are trying to access your local machine. This means that if you have ssh’ed into hive and want to use your computers browser, you will have to use the url of the hive machine. If you are working on one of the computers in the lab rooms and launch it on that machine, you should not have this issue.

For example, if you are ssh’ed into hive1 with the default port, you will have to use the following url to access your server: http://hive1.cs.berkeley.edu:8080.

If you get the following message after you run your project:

$ go run server.go
Server starting, port 8080, working dir: 'public_html/'
2019/04/14 18:30:16 listen tcp :8080: bind: address already in use
exit status 1

You will have to select a different port since that port is in use, so the site could not bind to that port.

If you get the following message:

$ go run server.go -p 10
Server starting, port 10, working dir: 'public_html/'
2019/04/14 18:32:45 listen tcp :10: bind: permission denied
exit status 1

You are not allow to bind to a port lower than 1024 without root permissions. Choose a port higher than that! You can run a port up to and including 65535. Any port number above that is invalid (For more information take networking!).

You may also find your project may get timeout errors or slow time to access files. This is because we have added a sleep to file reads to simulate a really slow disk. You may get a lot of timeouts from your server when the file takes a while to get cached. Though once the file is cached, it should be fast so long as it does not get evicted from the cache!

Testing

We have provided an example test case which is located in the server_test.go file. Although we are not requiring you to write tests, if you have a question about why you are not failing a test, please try your hardest to reproduce your issue locally first and then debug from there. This can help you better understand what your issue may be and will also give a TA, who you may ask for help, a better idea of what the issue is.

We will not release any of the autograder tests!

To run your tests: 1) cd into your project’s directory 2) Run go test. This will run all the tests in files named “XXX_test.go”

If you want to run a specific test, you can run the following command:

go test -run TestName

Where TestName is the name of the test you want to run.

Writing tests in Go is luckily pretty simple; however, you must adhere to certain naming conventions in order to get your tests to run correctly. To write your own tests, create a file with the name [FILE YOU WANT TO TEST]_test.go in the same directory as server.go. Here, you can simply augment the provided server_test.go file to add your own tests. Tests are just functions that start with the case-sensitive word Test followed by an uppercase word or phrase that provides a description of what is being tested, i.e. TestAudioFile.

Inside your test file, make sure the package testing is imported. Then, when writing test functions, let t *testing.T be the only parameter passed in to the function. Write whatever logic you need to simulate a real-world file caching scenario, and then compare your expected results to the ones you got. You can call t.Error(string) to indicate that something went wrong in the course of running the test. Likewise, you should call t.Fatal(string) if the scenario went so poorly that the test cannot proceed. t.Log(string) provides nonfatal debugging information. You can append an “f” to any of these testing functions to print out a formatted string, e.g. t.Logf("Expected %d, got %d", expected, actual).

Since we cannot open a web browser to make HTTP requests when writing automated tests, we have to instead simulate doing so in code. In server_test.go, you can see that we abstract away receiving a response from the HTTP handler with the ResponseWriterTester struct, which contains fields for data, the HTTP status code, and the HTTP header that we get back from the file server after making a request. Checking these after making a call to the server’s handler function can ensure that we get the right results. Also, notice that the ReplaceReadFile function in the userlib package lets us replace the default fileReader function, which normally reads from disk. This might be useful for writing tests.

Submission

You will be just submitting the server.go file to gradescope or you can also submit your github repo to gradescope. The autograder will run and only show you if you pass the sanity tests. I have added some helpful comments on how you can reproduce your issue if you fail one of the sanity tests. Please make sure you make a test which does what the test says it does to figure out what your issue is.

Notice: The autograder uses this version of GoLang so make sure you only use functions which are implemented in this version!

$ go version
go version go1.10.4 linux/amd64

The autograder will not run any of the tests if your code does not compile or you add additional imports!

FAQ

go get URL_TO_LIBRARY seems to not do anything! Am I doing something wrong?

Don’t worry, go get does not return anything on success! You can check that it worked by going to the directory cd $GOSOURCE/URL_TO_LIBRARY to see if it exists.

mkdir -p $GOPATH/src/github.com/61c-student or some_command $GOPATH/SOME_PATH is giving a permission denied error!

This probably means that you do not have the GOPATH set. For example, the mkdir command is probably trying to make the path /src/github.com/61c-student which is out of your permissions. Please read the GO section in background to get it set up.

Can we import the errors package? There seems to be no other way!

Take a look at the fmt library! You should be able to read docs to find what you need but I promise you do not need the errors package.

When I try to clone the repo (git clone https://github.com/61c-student/sp19-YOUR_GITHUB_USERNAME.git), it says the repo does not exist.

Make sure you have accepted the GitHub Classroom Assignment!

Do I have to work on the server.go file in the #GOPATH directory? It would be so much easier for me to put it in the ~/project5 directory.

You know… you are absolutely right! You can work on the server in whatever directory you want to and it should not be an issue. The only things which really need to go in the $GOPATH are libraries. It is good practice to work in there if you are building a library but for this project, you do not have to do that if you do not want. So long as the userlib is in the GOPATH in the correct path in it, you should be fine to work on the server.go file from wherever.

(Last updated: 4/20/19 9:30PM)