COMP30023 Project 2 replacement Web proxy

发布时间：2024-06-28

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP30023 Project 2 replacement

Web proxy

Weight: 15% of the final mark

1 Project Overview

The aim of this project is to familiarize you with socket programming. Your task is to write a caching web proxy for HTTP/1.1.

Your code must be written in C or rust. Submissions that do not compile and run on a cloud VM may receive zero marks.

A web proxy is a process that runs on an internet host and receives web requests for URLs hosted on other hosts. It either serves these requests from a cache, or forwards the requests to the actual hosts.

There are many reasons for using proxies.

One reason is to cache content. Web browsers cache content locally, but if multiple computers try to download the same content, they cannot get it from another browser’s cache. If they all use the same nearby proxy, then the proxy can download the content once, and individual computers can download copies from there.

Caching is difficult with HTTPS, which often simply rely on proxies to forward encrypted data without be-

ing able to look at the headers. (For the reasons for forcing HTTPS, see https://www.troyhunt.com/

heres-why-your-static-website-needs-httpsand the accompanying video https://www.youtube . com/watch?v=gZ1mM6OtXIc

Some public sites that do not force an upgrade to HTTPS, which you can use for testing from your personal machine, are

• http://www.washington.edu

• http://yimg.com

• http://icio.us

• http://rs6.net

• http://www.faqs.org/faqs

• http://www.wikidot.com

• http://www.videolan.org

• http://www.openoffice.org

Another reason is security. Virtual machines in the Melbourne Research Cloud have “private” IP addresses, and cannot make TCP connections to hosts on the global internet. However, they can download web resources because they are configured to use a proxy. The proxy has two IP addresses: one in the “private” address space accessible by the VMs and another in the global address space, which can reach the web servers.

2 Project Details

Your task is to design and code a simple caching web proxy.

2.1 Stage 1: Simple proxy

The first stage is simply to proxy all requests, without caching.

This stage will create a listening TCP socket on port 8000 of the local host ( 127.0.0.1 or ::1). For any GET request it receives on that socket, it should identify the host (the “origin server”), create a TCP connection to that and send the request to that host. It will then read the complete response and send it back to the host that sent the GET request.

Whenever a GET message is sent to the origin server, the program should log a line GETting url to stdout.

2.2 Stage 2: Naive caching

The second stage is to keep a copy of all requests and their responses. Allocate a cache with 10 entries, each 100kB in size.

For every GET request received, if the URL length is less than 1000 bytes, look in the cache to see if you have received this request before. If you do, then reply with the response that you received last time.

If you do not have the entry in the cache, evict the least recently used (LRU) element of the cache, fetch the response from the actual host, and if the URL is less than 1000 bytes,place it in the cache.

Whenever a response is sent from the cache, the program should log a line serving url from cache to stdout, instead of logging the GET command.

2.3 Stage 3: Valid caching

Note all responses can be cached.

For this stage, only responses to commands that can be cached (like GET and HEAD) should be cached. Also, responses that contain Cache-Control headers should be respected. If a Cache-Control header contains private, no-store, no-cache, max-age=0, must-revalidate or proxy-revalidate then do not cache that response.

Whenever an item is fetched but not put in the cache, the program should log a line Not caching url to stdout, after logging the GET command.

2.4 Stage 4: Expiration

The Cache-Control header can specify a max-age=xxx field. This specifies how many seconds the response is valid for. If it is cached, then the cache entry should become “stale” after that time.

For this stage, the code should not respond with stale cache entries.

Whenever a stale cache entry matches, the program should log a line Stale entry for url to stdout, before logging the GET command.

2.5 Stage 5: Checking for updates

If the cache entry is stale, it is still not always necessary to download the document again. Instead, it is possible to use the If-Modified-Since header or If-None-Match headerto request that the page be downloaded only if it is newer than the cached version.

This allows the server to reply with status code 304 (Not Modified) if the cached entry is still valid. The max-age setting in the response can be used to set a new expiration time for the cache entry.

This information can also be obtained using a HEAD request, although that is less efficient as it requires two HTTP requests.

Whenever a stale cache entry is refreshed this way, without being downloaded again, the program should log a line Entry for url refreshed to stdout, after logging the presence of the stale entry and after logging the GET request.

2.6 Stretch goal: Concurrent requests

It can take quite a while for the origin server to send a reply. During this time, another user of the proxy may make a request.

A stretch goal is to make your proxy able to serve multiple requests concurrently, rather than one at a time. This can be achieved either by multithreading or using services like select() to read from multiple sockets.

The marks allocated to this stretch goal are deliberately not worth effort it will take. It should only be attempted by those realistically hoping to get 15/15. If the project seems too big, then do not attempt this stretch goal.

Plagiarism policy: You are reminded that all submitted project work in this subject is to be your own individual work. Automated similarity checking software will be used to compare submissions. It is University policy that cheating by students in any form is not permitted, and that work submitted for assessment purposes must be the independent work of the student concerned.

Using git properly is an important step in the verification of authorship. We should see the stages of your code being written, not just the finished product.

AI software such as ChatGPT can generate code, but it will not earn you marks. You are allowed to use tools like ChatGPT, but if you do then you must strictly adhere to the following rules.

1. Have a file called AI.txt

2. That file must state the query you gave to the AI, and the response it gave

3. You will only be marked on the differences between your final submission and the AI output.

If the AI has built you something that gains you points for Task 1, then you will not get points for Task 1; the AI will get all those points.

If the AI has built you something that gains no marks by itself, but you only need to modify five lines to get something that works, then you will get credit for identifying and modifying those five lines.

4. If you ask a generic question like “How do I convert an integer to network byte order?” or “What does the error ‘implicit declaration of function rpc_close_server’ mean?” then you will not lose any marks for using its answer, but please report it in your AI.txt file.